EE290A HW3

A StrongARM Implementation of Viterbi Decoding

Philip Chong

1. Description

We examine a prototype algorithm for implementing Viterbi decoding on a StrongARM SA-1100 processor. The implementation style is compiled C code, and is loosely based on the reference code courtesy of Rhett Davis.

2. Architecture Details

The following are the interesting SA architecture details that seem to give us huge optimization opportunities:

Variable length bit shifts are completed in a single cycle (barrel shifter); this is a huge advantage in bit twiddling. In fact, almost every optimization used relies on this feature; to migrate to another architecture would be disastrous.
Single cycle issue ensures multiplication will be quick.
Conditional execution on all instructions means if-then-else constructs have no branching overhead.

Here are some of the lossages:

The small 8k data cache will be a huge penalty; if we don't deal with the traceback buffer in an intelligent way, we could thrash the cache.
The 3 cycle multiply latency means we have to worry about ordering such instructions to avoid pipeline stalls.
Similar considerations must be kept in mind with loads (1 cycle latency).

3. Optimizations

We assume the use of fixed-point arithmetic; since the SA wordlength is 32 bits, we should not need to worry about overflow in most places.

Note that (x+1)^2=(x-1)^2+4x ; since we need to compute both squares, this is a nice and cheap way to do both (moreso since 4x is a simple shift in fixed point, cheap on the SA).

We need to determine the code bits for each branch hypothesis; this is most easily done using a table lookup on the state being considered.

To keep the cache full, all large data arrays should be kept contiguous, one after the other; the cache associativity should then keep things from flushing each other out.

The traceback buffer is really only a series of bits; thus we can store 32 bits worth of traceback in a single word. This keeps the data size small and avoids cache misses.

4. Code

Here is a rough draft of the proposed algorithm, incorporating all the optimizations discussed above. There are variable declarations and other details missing, so this will not compile. There are probably other serious bugs as well. However, this is as close as things can get without having a complete implementation done. The main components are there, and so we use this as a basis for our performance estimation.

Code Here

5. Estimation

Function nextbit() is called with every received I-Q pair. We can assume this function is inlined, to avoid the calling overhead. The code indicates the cumulative cycle count (CCC) at strategic points. This count is fairly straightforward, with instructions using a single cycle each. Pipeline stalls are avoided completely, with the code rearrangements indicated in the comments. This function needs 12+32N=12+32(64)=2060 cycles to execute. Similarly, the cycle count for the traceback() routine is given. We assume a 128-level traceback is available; after every 32 bits, we flush out the oldest 32 bits from the buffer and output them. We estimate this routine will take 2075 cycles to flush out 32 bits. Finally, the main routine calls nextbit() 32 times and traceback() once, for every main loop iteration. Assuming the functions are inlined, we need 68032 cycles to decode 32 bits, for an estimated 2126 cycles per bit.

At a clock frequency of 220MHz, this gives a bit rate of roughly 100kbps.

As a power estimate, the SA-1100 consumes on average 550mW at maximum speed (220MHz) according to the data sheets. Let us consider the small memory requirement (only 4*64 words for traceback, 2*64 for metric buffers, plus miscellaneous storage and instruction storage) as a negligible contribution to the power of the overall system.

6. Possible Improvements

Some loop unrolling of the innermost loop can be done to save some instructions; as a rough estimate, we can probably increase the bit rate by about 10%.

The other source of optimization lies in exploting parallelism in the inner loop. If the metric computation uses 6 bit arithmetic, we could theoretically perform multiple metric calculations in a single cycle using a 32-bit word. This is somewhat more complicated, and will require more analysis.

pchong@eecs.berkeley.edu
Feb 02 1999