EE290A HW3
A StrongARM Implementation of Viterbi Decoding
Philip Chong
1. Description
We examine a prototype algorithm for implementing Viterbi decoding
on a StrongARM SA-1100 processor.
The implementation style is compiled C code, and
is loosely based on the reference code courtesy of Rhett Davis.
2. Architecture Details
The following are the interesting SA architecture details that
seem to give us huge optimization opportunities:
- Variable length bit shifts are completed in a single cycle (barrel
shifter); this is a huge advantage in bit twiddling. In fact, almost
every optimization used relies on this feature; to migrate to another
architecture would be disastrous.
- Single cycle issue ensures multiplication will be quick.
- Conditional execution on all instructions means if-then-else
constructs have no branching overhead.
Here are some of the lossages:
- The small 8k data cache will be a huge penalty; if we don't deal
with the traceback buffer in an intelligent way, we could thrash the
cache.
- The 3 cycle multiply latency means we have to worry about
ordering such instructions to avoid pipeline stalls.
- Similar considerations must be kept in mind with loads (1 cycle latency).
3. Optimizations
We assume the use of fixed-point arithmetic; since the SA wordlength
is 32 bits, we should not need to worry about overflow in most places.
Note that (x+1)^2=(x-1)^2+4x ; since we need to compute both
squares, this is a nice and cheap way to do both (moreso since 4x
is a simple shift in fixed point, cheap on the SA).
We need to determine the code bits for each branch hypothesis; this
is most easily done using a table lookup on the state being considered.
To keep the cache full, all large data arrays should be kept
contiguous, one after the other; the cache associativity should then
keep things from flushing each other out.
The traceback buffer is really only a series of bits; thus
we can store 32 bits worth of traceback in a single word. This keeps
the data size small and avoids cache misses.
4. Code
Here is a rough draft of the proposed algorithm, incorporating all the
optimizations discussed above. There are variable declarations
and other details missing,
so this will not compile. There are probably other serious
bugs as well. However, this is as close as things can get without
having a complete implementation done.
The main components are there, and so we use
this as a basis for our performance estimation.
5. Estimation
Function nextbit() is called with every received I-Q pair. We can assume
this function is inlined, to avoid the calling overhead. The code indicates
the cumulative cycle count (CCC) at strategic points. This count is
fairly straightforward, with instructions using a single cycle each.
Pipeline stalls are avoided completely, with the code rearrangements
indicated in the comments. This function needs 12+32N=12+32(64)=2060
cycles to execute.
Similarly, the cycle count for the traceback() routine is given. We
assume a 128-level traceback is available; after every 32 bits, we
flush out the oldest 32 bits from the buffer and output them.
We estimate this routine will take 2075 cycles to flush out 32 bits.
Finally, the main routine calls nextbit() 32 times and traceback()
once, for every main loop iteration. Assuming the functions are
inlined, we need 68032 cycles to decode 32 bits, for an estimated
2126 cycles per bit.
At a clock frequency of 220MHz, this gives a bit rate of
roughly 100kbps.
As a power estimate,
the SA-1100 consumes on average 550mW at maximum speed (220MHz) according
to the data sheets. Let us consider the small memory requirement
(only 4*64 words for traceback, 2*64 for metric buffers, plus miscellaneous
storage and instruction storage)
as a negligible contribution to the power of the overall system.
6. Possible Improvements
Some loop unrolling of the innermost loop can be done to save
some instructions; as a rough estimate, we can probably increase the
bit rate by about 10%.
The other source of optimization lies in exploting parallelism in
the inner loop. If the metric computation uses 6 bit arithmetic,
we could theoretically perform multiple metric calculations in a
single cycle using a 32-bit word. This is somewhat more complicated,
and will require more analysis.
pchong@eecs.berkeley.edu
Feb 02 1999