Simple thoughts & complex stuff: april 2024

Introduction

In the last few months, there were 2 occurrences where people were talking about the implementation of hyperthreading; the Intel version of Simultaneous Multi-Threading (SMT). In both discussions, people assumed that hyperthreading is effectively a very fast context switch mechanism on the core so that when one thread is waiting e.g. due to cache miss, the other thread can run. This is not how hyperthreading works.

TLDR

A modern core is pipelined, superscalar, and will execute instructions out of order. Hyperthreading is nothing more than a mechanism to help the core to increase parallelism. There is no fast context switch; the hyper siblings (2 threads on the same core) can truly run in parallel.

Short CPU architecture recap

Each CPU has a frontend responsible for fetching and decoding instructions. A backend: responsible for scheduling and executing instructions. And a memory subsystem that includes the load buffer, store buffer, and the coherent cache.

To speed up performance, modern CPUs are pipelined so that different instructions can be at different stages of the pipeline. Fetching, and decoding are some stages (frontend), but there are more like execution, and retirement (backend).

Modern CPUs are superscalar; there are multiple execution units e.g. ALU for addition/subtraction, branch execution, etc so that independent instructions can run in parallel. Not only a branch instruction can run in parallel with an add, but 2 adds can run in parallel as well. Superscalar allows a CPU (core) to retire more than 1 instruction per cycle.

Instructions from the front end are placed in the reorderer buffer so they can be scheduled by the backend. Instructions are issued in program order in the reorder buffer and they will retire in program order. The scheduler slides over the reorder buffer and puts instructions in reservation stations, where they will wait for their operands to become ready. Once the operands are ready, the instruction can be sent to the appropriate execution unit. The scheduler schedules the instructions dynamically using the Tomasulo algorithm and makes it possible for independent instructions to execute out of order.

If a core could only use the architectural registers, there would only be a little room for parallel execution of instructions because there are just too few registers. So apart from the 16 general-purpose architectural registers (32 with Intel APX), there are hundreds of physical registers. One of the tasks of the scheduler is register renaming; the mapping of the architectural registers to physical registers. Another task is to remove false dependencies (write-after-write or write-after-read) and only preserve the true dependencies (read-after-write).

Every write to memory ends up in the store buffer. For every store issued to the reorder buffer, a slot is allocated in the store buffer. Stores can execute out of order but will be committed to the coherent cache in order.

Every load to memory is performed using the load buffer. For every load issued to the reorder buffer, a slot is allocated in the load buffer. Loads can be performed out of order, but if the core detects that there could be a violation of memory order, the pipeline is nuked (lifting on top of speculative execution) and restarted.

If a load/store misses in the coherent cache (the L1D), the load/store is placed in the line fill buffer. The line fill buffer will trigger the appropriate cache coherence traffic and once the cache line has been returned in the right state, the load can complete and the store can be committed to the coherent cache.

The desire for parallelism

As we can see from the above descriptions, instructions are executed out of order and preferably in parallel. To increase efficiency, every clock cycle, as many instructions should be executed as possible.

ILP

A big task of the scheduler is to find instructions for a single thread of execution, that can be performed in parallel. This form of parallelism is called instruction level parallelism (ILP).

TLP

It can be very difficult to find sufficient parallelism inside a single instruction stream; e.g. when there are data dependencies. So instead of only looking at the instruction stream of a single thread, if the scheduler could play with instruction streams of multiple threads, it would have a bigger chance to find instructions that can be performed in parallel.

And that is exactly what hyperthreading does; it enables the scheduler to look at the instruction streams of 2 threads to keep the resources within the core busy and at any given moment, instructions from both different hypersiblings could be running in parallel. This is a form of thread-level parallelism (TLP). The Sun Niagara II could use 8 instruction streams. Another form TLP is multicore, the Intel version of Symmetric Multi-Processing (SMP).

How does hyperthreading impact the resources?

architectural state: copied
frontend: interleaving
reorder buffer, load buffer, and store buffer: static partitioning.
reservation stations, execution units, the line fill buffers: competative sharing

On the X86, it isn't allowed for a thread to look into the store-buffer of its hypersibling. This would violate the Total Store Order (TSO), the memory model of the X86, because the threads could disagree on the order of stores. But on the PowerPC this is allowed.

If you disable hyperthreading in the BIOS then all of the resources are given to a single thread and therefore it is more likely that you get better performance for a single thread. This might be useful when there is a limited number of threads and you want to get more performance out of them.

Simple thoughts & complex stuff

maandag 1 april 2024

How does hyperthreading work.