Micro-threading denotes a group of in-order processing architectures that schedule operations in hardware with the goal to mask long-latency operations, such as memory accesses or pipelined arithmetic or hardware-accelerated operations.
Micro-threading is based on a combination of two well-known concepts – von-Neumann execution and data-flow synchronization – that guarantee efficient scheduling of operations while maintaining the traditional sequential description of computation in the way commonly used to program general-purpose processors. Since the two concepts are orthogonal, micro-threading can be seen as a way to organize computation in processors that can be used with almost any instruction set architecture, and for single-core as well as multi-core set-ups.
Key features of the micro-threaded architecture are:
Computation organized in families of threads.
One micro-thread can be seen as one iteration of a for-loop, with a specific execution context defined by the program counter and a set of thread-specific registers.
Fast context switch.
A new context is introduced in the execute stage in 0 to 1 clock cycles, with at most one bubble introduced in the processing pipeline on context switch.
“Asynchronous” (delayed) register updates.
Registers are updated by the hardware with actual values independent of the processing pipeline as soon as the producing operations complete. Examples of typical operations are data fetch from external memory or completion of long-latency operations.
Hardware-managed context switching on unsatisfied data dependencies or instruction cache miss.
Threads are organized in the ASAP schedule in an optimal case, that is provided the processing pipeline is not busy computing other threads.
Hardware-managed thread creation and destruction.
Thread management is controlled on the level of the ISA through new machine-level instructions.
Software-controlled allocation and deallocation of thread entries.
The complexity of the family creation process is offloaded to the software where it is implemented as a sequence of thread-management instructions. This approach is analogous to the RISC approach.
Thread execution context is defined by:
The program counter and a set of registers allocated to the thread.
The program counter defines a sequence of instructions that form the thread. The allocated registers are a mix of family-global and thread-local registers. The first thread-local register is loaded with a unique thread ID on thread creation.
Thread-local storage and family-global storage.
Since the number of usable registers is limited due to the opcode definitions, a thread local storage and a family global storage is used for bigger sets of data.
Key blocks that form the micro-threaded processor are:
- Processing pipeline, covering both the integer pipeline as well as the floating-point pipeline.
- Thread management, responsible for register and thread allocations and deallocations, and for thread creation, scheduling, reuse and cleanup.
- Self-synchronizing register file, with each register having its unique state that is used for data-flow synchronization of threads.
- Loosely-coupled caches that support thread state update on cache-line fetch completion independent of the processing pipeline.
- Long-latency operations modules that implement multi-cycle operations, such as integer multiplication and division or floating-point operations with delayed register updates on operation completion.
The micro-threaded processor provides hardware acceleration for the execution and management of fine-grained logical threads, called “microthreads”.
The execution in each micro-thread is driven by the code of a “thread program”, declared with a special syntax close to the “kernel” syntax of CUDA or OpenCL. Also, like in CUDA or OpenCL, micro-threads in the micro-threaded platform can be activated in batches, called “families”, where each individual thread has access to a special index variable that uniquely identifies it within the family. When the platform is equipped with multiple hardware threads or connected cores, families are automatically spread and multiplexed over available hardware thread entries or parallel cores.
Each thread program can be (but not necessarily so) as short as a few machine instructions. It is the batching of many micro-threads at once, combined with dedicated thread management control in hardware, that provide the characteristic performance speed-ups of the architecture compared to a pure software approach with explicit control flow.
The main difference with other contemporary micro-threaded accelerators is that micro-threads in the micro-threaded processor are fully MIMD, and each individual thread is fully able to use any platform service including calling arbitrary C functions and creating and synchronizing with their own sub-threads.