Implementation of a Viterbi Decoder on the ARM Architecture

Philip Chong

Introduction

We describe an implementation of the Viterbi decoding algorithm on the ARM microprocessor architecture. This implementation corresponds to the specification for the overall project [5], and thus we can use these results in comparison with the rest of the groups' results to come to some conclusions on the design of components for use in larger systems.

ARM Overview

The ARM family of microprocessors [2] are 32-bit processors, featuring a scalar integer unit, with no floating point hardware. The architecture is a fairly standard 5-state RISC pipeline. Most ALU operations (including a barrel shift, capable of arbitrary shifts of data) require one cycle to complete; some instructions can be issued in one cycle, but delay their results for several cycles.

We will see in the next section how we take advantage of the ideosyncrasies of the ARM architecture to optimize the Viterbi implementation.

Architecture-Specific Optimizations

Using Lookup Tables

The branch metric computation specified in the project description can be very inefficient if implemented using ALU operations. Two multiplications and an addition are required for each branch computation. Multiplications are especially costly, as the result from these operations is only available 3 cycles after the instruction is issued; these pipeline slots must be filled with instructions which are independent of the multiplication results [1].

In contrast, implementing the branch metric computation as a lookup table results in a significant performance gain. Memory accesses require only a 1 cycle delay slot, as opposed to the potential 3 for multiplication. Thus not only are the number of instructions decreased, but we then have fewer constraints on instruction scheduling.

The only drawback to using a lookup table is the potentially large size of the table. Since the I and Q components of the metric act orthogonally, we only require a 64-element array to for each functionally-distinct metric computation. There are two such tables (one for (x-1)(x-1) and one for (x+1)(x+1)), so 128 words (512 bytes) are required.

Additionally, the Viterbi code computation (essentially a parity computation) is efficiently done with a lookup table as well; this takes 64 words, so a total of 768 bytes is needed for lookup tables.

Caching Strategy

Early on, we recognized that poor cache management would result in a large performance degradation; minimization of memory utilization would be essential to achieving good data rates. To this end, we chose to implement the traceback buffer as a packed array of bits; each traceback only contains one bit of information, the LSB of the previous state, and the upper bits can be derived from the current state. Thus 32 bits of traceback can be packed into a single word in memory. Extraction of the packed bits is made efficient through the barrel shifter, and so the performance lost due to unpacking is minimized.

Quality of Results

The bit error rate (BER) achieved using 16-bit fixed point computation and a 128-step traceback buffer was 1.50e-4 for the reference data set. Using the SNR degradation estimation provided with the reference data gives a resulting 0.196 dB degradation. This is exactly the same results obtained from the reference implementation, so we assume the performance of the ARM implementation is acceptable.

We attempted to adjust the algorithm to obtain better performance; an easy way to do this is to reduce the traceback depth. Reducing the traceback to 64-levels increased the BER to 7.00e-4 with a corresponding 0.725 dB SNR degradation. This is unacceptable, given a 0.05 dB SNR budget.

We did another experiment where we used a single-bit encoding scheme; using single bits may give an easy way to perform many path metric computations in parallel, by using the full 32-bit width of the ALU. However, the quality degradation is severe; the corresponding BER is 1.36e-2, with a 1.963 dB SNR degradation.

The final implementation was thus chosen to use 16-bit resolution and a 128-level traceback, in order to meet the quality of results requirements.

Simulation

An instruction-level simulator for the ARM8 core was available from Prof. Rabaey's group. This simulator estimates power consumed for a given program based on profiling of the code.

Power models were also available for the SA-110 microcontroller, but the parameters of the project dictated that we use a standalone core, so the ARM8 was chosen here. The models also include power for an 8k unified I+D cache.

Unfortunately, power characterization was only available for the core running at 125 MHz and 3.3 V. As we will see in the next section, we require higher performance to match the desired bitrate for the project. We will subsequently show how to scale these results to meet the performance requirements.

Results

Raw Results

For decoding 4096 bits, the simulator indicated that 9.67 Mcycles of computation would be required, with 53.14 mW of power used. This gives a datarate of 52947 bits/s. Again, the power is characterized for 125 MHz operation and 3.3 V.

These figures include the power used for the cache. Also, the figures here differ slightly from those given in the presentation slides; the figures in this report are somewhat newer, and reflect some additional optimization of the code.

Result Scaling

We note that the ARM architecture is capable of higher speeds than that given by the simulator; 275 MHz SA-110 devices are available in current commercial products [3]. As well, the SA-110 family from Intel operates at 2.0 V [1]. Thus we assume that these are achievable goals for our design. If we assume that power varies with the square of the voltage and linearly with the frequency, then our results scale to give a 116484 bits/s datarate and 42.94 mW power consumption.

Discussion

Design Cost (Area)

The ARM8 core proper occupies an area of 3.75 mm x 1.65 mm, while the 8k cache needs 3.2 mm x 0.4 mm, both in a 0.25 um process. Thus the total area for the design is 7.47 mm^2 [4].

Design Effort

An estimated 4 person-days of design work was required for this design, including a small amount of time for familiarization with the ARM architecture for design optimization.

The software is written completely in ANSI C code [6], so this design should be very portable. Some of the architecture-dependent design optimizations may need to be rethought based on the new architecture, however.

Summary

Clock Speed                      275 MHz
Execution Performance            116.5 kb/s
Area                             7.47 mm^2 (0.25 um)
Power Dissipation                42.94 mW
                                 5.68 mW/mm^2

Acknowledgements

Many thanks to Marlene Wan for providing the power estimations, and other information about the ARM simulator.

References

StrongARM(tm) SA-110 Microprocessor Technical Reference Manual, http://developer.intel.com/design/strong/manuals/278058.htm
ARM Architecture Overview, http://www.arm.com/Architecture/ (page was available from previous homeworks, now missing)
Hardware Computing Canada Web Site, http://www.hcc.ca/
Marlene Wan, personal correspondence
Project Overview, http://www-cad.eecs.berkeley.edu/~newton/Classes/EE290sp99/pages/homework.html#Project
Project Source Code, decode.c