dinsdag 20 juli 2021

TSO and IBM System/370

[Document is currently being updated due to new insights] The IBM System/370 (IBM 370) is mainframe from the 70s.


You would probably wonder why I would discuss a machine a few people have heard of and even fewer people will ever need to work with. To understand memory models like those from Java and C++ and to understand performance implications, I find it useful to have some insights into hardware memory models, how they are implemented, and especially how they are related. The reason why the IBM 370 is so interesting is that it shows a very nice transistion from sequential consistency (SC) to total store order (TSO).

TSO is the memory model that is used on the X86, but there are ISA's like the SPARC V8/v9 that offer TSO. A performance advantage of TSO over sequential consistency (SC) is that when a CPU does a store and there is a write miss, the CPU needs to wait for the cache line to be invalidated on all CPUs that use this cache line before it can write to the cache-line or it needs to wait for main memory. As a consequence, the performance of the CPU is reduced because it can't execute other instructions since this could lead to loads/stores being performed out of order in the memory order. Effectively it isn't able to hide the memory latency.

Instead of blocking, since non-speculative stores are going to be written to the cache anyway, the stores are written to a store buffer that sits between the CPU and the cache. The stores in the store buffer are in program order and will be committed to the cache in program order.

Store/Load reordering

Because stores are buffered, this can lead to store being reordered, in the memory order, with a load to a different address. This is demonstrated with the load buffering litmus test.
Initial:
   X=0, Y=0

CPU1:
   X=1
   r1=Y

CPU2:
   Y=1
   r2=X
Can it be that r1=0, r2=0?

With SC this can't happen because no reordering is allowed (PO needs to be preserved in the memory order) and therefore an earlier load can't be reordered with a later store to a different address. But with TSO this can happen due to store buffers; even though the store retires before the load retires, the store is placed in the memory order after the load was placed in the memory order.

So with SC, the memory order will respect the PO of each thread; in other words it will be consistent with all 4 fences [LoadLoad][LoadStore][StoreLoad][StoreStore]. But since with TSO an older store can be reordered with a newer load to a different address, the memory order will only be consistent with preserved program order of each CPU: that is [LoadLoad][LoadStore][StoreStore] since the [StoreLoad] fence is be dropped.

Load/Store reordering

Is it allowed for an earlier load to be reordered with a later store to a different address? This can be expressed using the load buffering litmus test:
Initial:
   X=0, Y=0

CPU1:
  r1=X
  Y=1
  
CPU2:
   r2=Y
   X=1
Is the outcome r1=1 and r2=1 possible?

With TSO this isn't allowed to reorder an older load with newer store to a different address. Think about it; the load is globally performed (read from cache) before it retires. The store will retire after the load and the store will only be globally performed (written to cache) after it retires. So an earlier load to a different address will be performed in the memory order before a later store to a different address.

Load/Load and Store/Store reordering

Is it allowed for an earlier load to be reordered with a later load to a different address? And the same goes for 2 stores? This is expressed using the message passing litmus test:
Initial:
   X=0, Y=0

CPU1:
  X=1
  Y=1

CPU2:
  r1=Y
  r2=X
Can it be that r1=1 and r2=0?

With TSO this isn't allowed because it will preserve [LoadLoad] and [StoreStore]. The Intel X86 does make use of speculative out-of-order execution of loads, but if it detects a potential reordering, it will flush the pipeline and try again. This situation is called a memory order violation.

Store to Load forwarding (STLF)

One difference between SC and TSO is that TSO drops the [StoreLoad] fence. But there is another very important difference, with SC every load/store is atomic (such a store is called single copy atomic) and you get the guarantee that at least one total order exist.

TSO doesn't provide single copy atomic stores due to the store buffer. So imagine the following program:
CPU1:
   A=1
   r1=A
If the load would not look in the store buffer for the store, the CPU would not see its own store. So with TSO, the CPU can look into the store buffer before the CPU is committed to the cache. This is called store to load forwarding (STLF). The problem is that the CPU can now see its own store before other CPUs can. Such a store is called a multi-copy atomic since there are 2 copies; one in the store buffer and one in the cache.

This can lead to some weird behavior as is demonstrated with the following test:
Initial:
   X=0, Y=0

CPU1:
   X=1
   r1=X
   r2=Y

CPU2:
   Y=1
   r3=Y
   r4=X
Can it be that r1=1, r3=1, and r2=0 and r4=0? So could it be that the r2=Y jumps before r1=X? And could r4=X jump before r3=Y? With TSO this is allowed because X=1 could be globally performed after r2=Y due to store buffers. r1=X can only be globally performed after X=1 is globally performed, so this means that r1=X is globally performed after r2=Y is globally performed. So effectively the store of X in the store buffer carries the load of X in the global memory order after the load of Y.

Keep in mind that globally performed isn't the same as executed because the r1=X could very well be executed before r2=Y.

The consequence is that in the memory order, X=1 happened before r1=X. But the only way to get this outcome is if r1=X is performed before X=1. This means that the store happened before the load and the load happened before the store; so we have a cycle on the same address. The consequence is that the memory order can't always order loads/stores to the same address and the memory order doesn't provide a total order over all loads/stores. This is a direct consequence of mult-copy store atomicity.

Total order over stores

Luckily, multi-copy store atomicity will still define a total order over the stores; so there is still a single moment in time where the store comes globally visible (when it commits to the cache). That is how TSO gets its name. So it will guarantee that at least 1 total order exist over stores issued by different CPU's to different addresses. This can be demonstrated with the Independent read of independent writes (IRIW) litmus test:
Initial:
  X=0, Y=0
  
CPU1:
  X=1
  
CPU2:
  Y=1
  
CPU3:
  r1=X
  r2=Y
  
CPU4:
  r3=Y
  r4=X
Can it be that r1=1, r2=0, r3=1, r4=0; so could it be that the CPUs see the changes to different addresses issued by different CPUs in different orders? With TSO this isn't allowed because of the total order over the stores.

IBM 370

So how does the IBM 370 fit into the picture? The IBM 370 sits exactly between SC and TSO. As with SC, IBM 370 requires a total order over all loads/stores. And as with TSO, older stores can be reordered to newer loads to a different address; so the memory order will preserve [LoadLoad][LoadStore][StoreStore].

If there is a store followed by a load to the same address, it requires that the store becomes globally visible before the load can be performed. What is the point if the same value is returned for the load as with TSO? It will prevent the reordering of an earlier load that can be satisfied using STLF and a later load that need to come from the cache and this will ensure that the store is single-copy store atomic instead of multi-copy store atomic.

I'll explain it using the previous example.
Initial:
   X=0, Y=0

CPU1:
   X=1
   r1=X
   r2=Y

CPU2:
   Y=1
   r3=Y
   r4=X
With TSO, the r1=X and r2=Y can be reordered due to STLF. But with IBM 370, the load of r1=X can only be globally performed after the X=1 is globally performed and the load of r2=Y can only be globally performed after r1=X is globally performed, so this forces the loads to be performed in order. And as a consequence we can't end up with r1=1, r2=0, r3=1,r4=0.

For more information check Shared Memory Consistency Models: A Tutorial.

Geen opmerkingen:

Een reactie posten

How does hyperthreading work.

Introduction In the last few months, there were 2 occurrences where people were talking about the implementation of hyperthreading; the In...