LEON SIMD-within-a-Register
Arithmetic

Overview

The SIMD-within-a-register (SWAR) arithmetic extensions are targeted namely towards fixed-point applications that work with sub-word precision, e.g. 3 bits (in LEON2 1 word consists of 32 bits), such as satellite navigation applications or data encryption; the performance of the processor can be increased through an implementation of SIMD-like operations on variables that are stored alongside in one 32-bit word, thus sharing the data-path circuitry for two or more operations executed in one clock cycle.

images/ip-cores/swar_unit_v2.png

SWAR unit - internal structure.

The SWAR instruction extensions are implemented as a SWAR unit that is connected in parallel to the integer ALU in the LEON2 integer pipeline (iu.vhd). The SWAR unit contains one or more SWAR modules and an optional module with SWAR accumulators. The actual configuration of the SWAR unit can be selected by the user before LEON2 synthesis using the make xconfig mechanism (see [RD1]).

At present six SWAR modules have been developed. Three application-specific modules are meant to accelerate GNSS processing:

  • Correlation of GNSS signals (sum of products for 1-4 bit data words),
  • Demodulation of GNSS signals (real and complex vector multiplication for 2-4 bit data words),
  • Sine / cosine lookup for GNSS demodulation (lookup of 1-4 bit values for 32 bit arguments),

Three more generic modules that are meant to accelerate applications that work with up to 16-bit numbers and use mostly addition, subtraction and multiplication:

  • Audio processing, data partitioned to 2x 16-bit words in one 32-bit register (ADD, SUB, MUL with optional reduction),
  • Video processing, data partitioned to 4x 8-bit words in one 32-bit register (ADD, SUB, MUL with optional reduction),
  • Generic ALU with user-defined data partitioning (ADD, SUB, MUL with optional reduction).

In addition accumulators can be used in connection with the audio ALU, video ALU or generic ALU modules; they have to be configured to fit the maximum number of lanes and maximum slice width over the selected SWAR ALU modules:

  • up to 16 independent accumulator registers,
  • each accumulator register up to 64 bits wide.

The user can specify register partitioning at synthesis time. The following picture shows an example of register partitioning for vector correlation.

swar_gnss_v4.png

Correlation - mapping of subwords in source registers.

SWAR configurations can be divided in two principal groups - general-purpose computing, and GNSS processing. A definition of common configurations is shown in the following table.

Common SWAR configurations - supported sub-32b word operations.
Identifier correl demod sincos audio video ALU ACC
8b, 10b, 16b ALU w/ ACC
swaralu N N N N N 3x10b 3x14b
swaraudio N N N Y N N 2x20b
swarvideo N N N N Y N 4x12b
GNSS
swargnss-2b 2b 2b 2b N N N N
swargnss-3b 3b 3b 3b N N N N
swargnss-4b 4b 4b 4b N N N N

Validation

The SWAR unit has been validated in several independent ways:

  1. By hand-transforming the GNSS tracking loop code to work with 2-, 3- and 4-bit values, and by transforming elementary computation kernels into SWAR modules while verifying that the tracking loop can still decode bits of the navigation message. First the SWAR modules were represented by their functional models in C. The functional models were then replaced by actual SWAR modules implemented in VHDL.
  2. By validating SWAR modules developed in VHDL against their functional models writted in C.
  3. By validating LEON2 execution of the tracking loop, using the SWAR functional models in C or the actual SWAR modules in VHDL against desktop execution of the tracking loop using the SWAR functional models in C.

Availability

The SWAR unit IP core is provided in the form of a synthesizable VHDL code or FPGA netlist. It is available either separately or bundled together with the LEON2-FT processor. Deliverables include:

  • VHDL-RTL code or gate-level netlist,
  • testing environment,
  • simulation scripts,
  • golden reference test vectors,
  • synthesis scripts,
  • user documentation.

The IP core is guaranteed against defects for ninety days from the date of purchase. Thirty days of technical support over email and phone is included. Additional support and maintenance options are available.

Hardware Compatibility

The SWAR unit is compatible with the following processors:

  • LEON2 / LEON2-FT
  • LEON3 / LEON3-FT

Software Compatibility

The LEON SWAR extensions are controlled through new user-defined data types in C programs. This requires the use of SPARCv8 llvm compiler and binutils with daiteq extensions. Alternatively, the LEON SWAR extensions can be controlled directly with assembler instructions in C programs; this requires the use of a legacy SPARCv8 C compiler together with binutils with daiteq extensions.

Xilinx Virtex7 Implentation Results

Implementation parameters for common SWAR configurations implemented in Xilinx Virtex7 are shown in the following table.

Resource requirements for the LEON2 processor core with common SWAR configurations implemented in Xilinx Virtex7.
Flavour Slices Slice regs LUTs LUTRAM DSP48E1 Freq [MHz]
LEON2
no swar 6703 6573 15507 408 15 102
LEON2 w/ 8b, 10b, 16b ALU w/ ACC
swaralu 5700 6747 15911 408 18 103
swaraudio 6882 6724 15780 408 17 106
swarvideo 6977 6816 15871 407 19 103
LEON2 w/ GNSS
swargnss-2b 7390 6711 16257 408 15 103
swargnss-3b 5759 6695 16305 408 15 107
swargnss-4b 6500 6717 16460 408 15 96

MicroSemi PolarFire Implentation Results

Implementation parameters for common SWAR configurations implemented in MicroSemi PolarFire are shown in the following table.

Resource requirements for the LEON2 processor core with common SWAR configurations implemented in MicroSemi PolarFire.
Flavour Fabric 4LUT Fabric DFF uSRAM 1K uSRAM 18K Math (18x18) Freq [MHz]
LEON2
no swar 25025 7014 12 41 13 109
LEON2 w/ 8b, 10b, 16b ALU w/ ACC
swaralu 25545 7401 12 41 16 109
swaraudio 25539 7353 12 41 15 108
swarvideo 25641 7523 12 41 17 107
LEON2 w/ GNSS
swargnss-2b 25614 7490 12 41 13 108
swargnss-3b 25533 7348 12 41 33 107
swargnss-4b 25675 7345 12 41 29 108

NanoXplore NG-Medium Implentation Results

Resource requirements for the LEON2 processor core with common SWAR configurations implemented in NanoXplore NG-Medium.
Flavour 4-LUT DFF XLUT RFB DSP RAM Freq [MHz]
LEON2
no swar 11100 3686 0 0 3 28 25.277
LEON2 w/ 8b, 10b, 16b ALU w/ ACC
swaralu 11665 3815 0 0 6 28 25.298
swaraudio 11604 3814 0 0 5 28 24.636
swarvideo 11579 3824 0 0 7 28 22.557
LEON2 w/ GNSS
swargnss-2b 11806 3799 0 0 3 28 23.094
swargnss-3b 11808 3799 0 0 3 28 26.466
swargnss-4b 11817 3803 0 0 3 28 23.097

GNSS Performance with SWAR Arithmetic

To evaluate a potential benefit of the SWAR extensions can be evaluated the following configurations of the GNSS tracking loop algorithm have been implemented and profiled:


  1. Octave-equivalent, i.e. all samples and computations in floating point, original USRP data file with complex samples coded using 2x 16 bits, buffered execution (store all intermediate results in arrays like in Octave). Computed correlation results are identical to those computed in Octave.

  1. Like the previous step, but navigation samples and the carrier wave (sine/cosine) values quantized to 2 bits.

  1. Samples quantized to 2b values, integer arguments for spreading code expansion and carrier generation, expand separate Early, Prompt, Late codes, do not use SWAR instructions, buffered execution
  2. Like the previous step, but use SWAR for demodulation
  3. Like the previous step, but use SWAR also for demodulation and correlation
  4. Like the previous step, but use SWAR also for sine/cosine lookup

  1. Samples quantized to 2b values, integer arguments for spreading code expansion and carrier generation, expand separate Early, Prompt, Late codes, use SWAR for sine/cosine lookup, demodulation and correlation, fused carrier generation and demodulation. Navigation samples are read from the FIFO and stored in a buffer at the beginning of the processing.
  2. Like the previous step, but all processing steps fused, i.e. carrier generation, demodulation and correlation. Navigation samples are read from the FIFO when needed (without buffering).
  3. Like the previous step, but expand just one spreading code and use it for all Early, Prompt and Late codes
  4. Like the previous step, but use HW support for spreading code expansion

The following table lists execution times for computing one iteration of the GNSS tracking loop in LEON2 with 2-bit GNSS SWAR extensions running in LEON2-FT at 25MHz after the indicated optimization steps were implemented in the algorithm. The table starts with a configuration that computed all floating-point operations in software and did not use any SWAR operations. The last row lists execution times for an implementation that used the daiFPU and the SWAR GNSS operations.

GNSS tracking loop - execution times for the main processing steps.
Config FIFO Codes Carrier Demodulation Correlation Total
. [us] [us] [us] [us] [us] [us]
A - Original reference code, 2-bit values, SoftFloat
2 63‘942 4‘874‘978 9‘624‘254 56‘033 108‘136 14‘530‘587
B - like A plus SWAR instructions and daiFPU
5 2‘343 82‘868 105‘190 9‘186 5‘771 208‘005
6 2‘430 82‘689 20‘601 9‘182 6‘204 121‘286
C - like B plus fused steps
7 2‘413 83‘874   20‘753 5‘812 113‘396
8   83‘873   20‘129 6‘116 110‘118
D - like C plus just one expanded code
9   27‘772   18‘245 4‘366 50‘802
10   15‘286   18‘300 4‘501 38‘502

Additional information is available on request.