LEON SIMD-within-a-Register
Arithmetic
Overview
The SIMD-within-a-register (SWAR) arithmetic extensions are targeted namely towards fixed-point applications that work with sub-word precision, e.g. 3 bits (in LEON2 1 word consists of 32 bits), such as satellite navigation applications or data encryption; the performance of the processor can be increased through an implementation of SIMD-like operations on variables that are stored alongside in one 32-bit word, thus sharing the data-path circuitry for two or more operations executed in one clock cycle.

SWAR unit - internal structure.
The SWAR instruction extensions are implemented as a SWAR unit that is connected in parallel to the integer ALU in the LEON2 integer pipeline (iu.vhd). The SWAR unit contains one or more SWAR modules and an optional module with SWAR accumulators. The actual configuration of the SWAR unit can be selected by the user before LEON2 synthesis using the make xconfig mechanism (see [RD1]).
At present six SWAR modules have been developed. Three application-specific modules are meant to accelerate GNSS processing:
- Correlation of GNSS signals (sum of products for 1-4 bit data words),
- Demodulation of GNSS signals (real and complex vector multiplication for 2-4 bit data words),
- Sine / cosine lookup for GNSS demodulation (lookup of 1-4 bit values for 32 bit arguments),
Three more generic modules that are meant to accelerate applications that work with up to 16-bit numbers and use mostly addition, subtraction and multiplication:
- Audio processing, data partitioned to 2x 16-bit words in one 32-bit register (ADD, SUB, MUL with optional reduction),
- Video processing, data partitioned to 4x 8-bit words in one 32-bit register (ADD, SUB, MUL with optional reduction),
- Generic ALU with user-defined data partitioning (ADD, SUB, MUL with optional reduction).
In addition accumulators can be used in connection with the audio ALU, video ALU or generic ALU modules; they have to be configured to fit the maximum number of lanes and maximum slice width over the selected SWAR ALU modules:
- up to 16 independent accumulator registers,
- each accumulator register up to 64 bits wide.
The user can specify register partitioning at synthesis time. The following picture shows an example of register partitioning for vector correlation.

Correlation - mapping of subwords in source registers.
SWAR configurations can be divided in two principal groups - general-purpose computing, and GNSS processing. A definition of common configurations is shown in the following table.
Identifier | correl | demod | sincos | audio | video | ALU | ACC |
---|---|---|---|---|---|---|---|
8b, 10b, 16b ALU w/ ACC | |||||||
swaralu | N | N | N | N | N | 3x10b | 3x14b |
swaraudio | N | N | N | Y | N | N | 2x20b |
swarvideo | N | N | N | N | Y | N | 4x12b |
GNSS | |||||||
swargnss-2b | 2b | 2b | 2b | N | N | N | N |
swargnss-3b | 3b | 3b | 3b | N | N | N | N |
swargnss-4b | 4b | 4b | 4b | N | N | N | N |
Validation
The SWAR unit has been validated in several independent ways:
- By hand-transforming the GNSS tracking loop code to work with 2-, 3- and 4-bit values, and by transforming elementary computation kernels into SWAR modules while verifying that the tracking loop can still decode bits of the navigation message. First the SWAR modules were represented by their functional models in C. The functional models were then replaced by actual SWAR modules implemented in VHDL.
- By validating SWAR modules developed in VHDL against their functional models writted in C.
- By validating LEON2 execution of the tracking loop, using the SWAR functional models in C or the actual SWAR modules in VHDL against desktop execution of the tracking loop using the SWAR functional models in C.
Availability
The SWAR unit IP core is provided in the form of a synthesizable VHDL code or FPGA netlist. It is available either separately or bundled together with the LEON2-FT processor. Deliverables include:
- VHDL-RTL code or gate-level netlist,
- testing environment,
- simulation scripts,
- golden reference test vectors,
- synthesis scripts,
- user documentation.
The IP core is guaranteed against defects for ninety days from the date of purchase. Thirty days of technical support over email and phone is included. Additional support and maintenance options are available.
Hardware Compatibility
The SWAR unit is compatible with the following processors:
- LEON2 / LEON2-FT
- LEON3 / LEON3-FT
Software Compatibility
The LEON SWAR extensions are controlled through new user-defined data types in C programs. This requires the use of SPARCv8 llvm compiler and binutils with daiteq extensions. Alternatively, the LEON SWAR extensions can be controlled directly with assembler instructions in C programs; this requires the use of a legacy SPARCv8 C compiler together with binutils with daiteq extensions.
Xilinx Virtex7 Implentation Results
Implementation parameters for common SWAR configurations implemented in Xilinx Virtex7 are shown in the following table.
Flavour | Slices | Slice regs | LUTs | LUTRAM | DSP48E1 | Freq [MHz] |
---|---|---|---|---|---|---|
LEON2 | ||||||
no swar | 6703 | 6573 | 15507 | 408 | 15 | 102 |
LEON2 w/ 8b, 10b, 16b ALU w/ ACC | ||||||
swaralu | 5700 | 6747 | 15911 | 408 | 18 | 103 |
swaraudio | 6882 | 6724 | 15780 | 408 | 17 | 106 |
swarvideo | 6977 | 6816 | 15871 | 407 | 19 | 103 |
LEON2 w/ GNSS | ||||||
swargnss-2b | 7390 | 6711 | 16257 | 408 | 15 | 103 |
swargnss-3b | 5759 | 6695 | 16305 | 408 | 15 | 107 |
swargnss-4b | 6500 | 6717 | 16460 | 408 | 15 | 96 |
MicroSemi PolarFire Implentation Results
Implementation parameters for common SWAR configurations implemented in MicroSemi PolarFire are shown in the following table.
Flavour | Fabric 4LUT | Fabric DFF | uSRAM 1K | uSRAM 18K | Math (18x18) | Freq [MHz] |
---|---|---|---|---|---|---|
LEON2 | ||||||
no swar | 25025 | 7014 | 12 | 41 | 13 | 109 |
LEON2 w/ 8b, 10b, 16b ALU w/ ACC | ||||||
swaralu | 25545 | 7401 | 12 | 41 | 16 | 109 |
swaraudio | 25539 | 7353 | 12 | 41 | 15 | 108 |
swarvideo | 25641 | 7523 | 12 | 41 | 17 | 107 |
LEON2 w/ GNSS | ||||||
swargnss-2b | 25614 | 7490 | 12 | 41 | 13 | 108 |
swargnss-3b | 25533 | 7348 | 12 | 41 | 33 | 107 |
swargnss-4b | 25675 | 7345 | 12 | 41 | 29 | 108 |
NanoXplore NG-Medium Implentation Results
Flavour | 4-LUT | DFF | XLUT | RFB | DSP | RAM | Freq [MHz] |
---|---|---|---|---|---|---|---|
LEON2 | |||||||
no swar | 11100 | 3686 | 0 | 0 | 3 | 28 | 25.277 |
LEON2 w/ 8b, 10b, 16b ALU w/ ACC | |||||||
swaralu | 11665 | 3815 | 0 | 0 | 6 | 28 | 25.298 |
swaraudio | 11604 | 3814 | 0 | 0 | 5 | 28 | 24.636 |
swarvideo | 11579 | 3824 | 0 | 0 | 7 | 28 | 22.557 |
LEON2 w/ GNSS | |||||||
swargnss-2b | 11806 | 3799 | 0 | 0 | 3 | 28 | 23.094 |
swargnss-3b | 11808 | 3799 | 0 | 0 | 3 | 28 | 26.466 |
swargnss-4b | 11817 | 3803 | 0 | 0 | 3 | 28 | 23.097 |
GNSS Performance with SWAR Arithmetic
To evaluate a potential benefit of the SWAR extensions can be evaluated the following configurations of the GNSS tracking loop algorithm have been implemented and profiled:
- Octave-equivalent, i.e. all samples and computations in floating point, original USRP data file with complex samples coded using 2x 16 bits, buffered execution (store all intermediate results in arrays like in Octave). Computed correlation results are identical to those computed in Octave.
- Like the previous step, but navigation samples and the carrier wave (sine/cosine) values quantized to 2 bits.
- Samples quantized to 2b values, integer arguments for spreading code expansion and carrier generation, expand separate Early, Prompt, Late codes, do not use SWAR instructions, buffered execution
- Like the previous step, but use SWAR for demodulation
- Like the previous step, but use SWAR also for demodulation and correlation
- Like the previous step, but use SWAR also for sine/cosine lookup
- Samples quantized to 2b values, integer arguments for spreading code expansion and carrier generation, expand separate Early, Prompt, Late codes, use SWAR for sine/cosine lookup, demodulation and correlation, fused carrier generation and demodulation. Navigation samples are read from the FIFO and stored in a buffer at the beginning of the processing.
- Like the previous step, but all processing steps fused, i.e. carrier generation, demodulation and correlation. Navigation samples are read from the FIFO when needed (without buffering).
- Like the previous step, but expand just one spreading code and use it for all Early, Prompt and Late codes
- Like the previous step, but use HW support for spreading code expansion
The following table lists execution times for computing one iteration of the GNSS tracking loop in LEON2 with 2-bit GNSS SWAR extensions running in LEON2-FT at 25MHz after the indicated optimization steps were implemented in the algorithm. The table starts with a configuration that computed all floating-point operations in software and did not use any SWAR operations. The last row lists execution times for an implementation that used the daiFPU and the SWAR GNSS operations.
Config | FIFO | Codes | Carrier | Demodulation | Correlation | Total |
---|---|---|---|---|---|---|
. | [us] | [us] | [us] | [us] | [us] | [us] |
A - Original reference code, 2-bit values, SoftFloat | ||||||
2 | 63‘942 | 4‘874‘978 | 9‘624‘254 | 56‘033 | 108‘136 | 14‘530‘587 |
B - like A plus SWAR instructions and daiFPU | ||||||
5 | 2‘343 | 82‘868 | 105‘190 | 9‘186 | 5‘771 | 208‘005 |
6 | 2‘430 | 82‘689 | 20‘601 | 9‘182 | 6‘204 | 121‘286 |
C - like B plus fused steps | ||||||
7 | 2‘413 | 83‘874 | 20‘753 | 5‘812 | 113‘396 | |
8 | 83‘873 | 20‘129 | 6‘116 | 110‘118 | ||
D - like C plus just one expanded code | ||||||
9 | 27‘772 | 18‘245 | 4‘366 | 50‘802 | ||
10 | 15‘286 | 18‘300 | 4‘501 | 38‘502 |
Additional information is available on request.