maandag 22 november 2021

Will temporary assignment to local variable impact performance?

This post is written as a consequence of the following stack overflow question. The question is if temporary assigning the results of some expression to local variables has any performance impact.

So is there any performance difference between the following 2 snipits:

int a = triple.getA();
int b = triple.getB();
int c = triple.getC();
return a + b + c;

and:

return triple.getA() + triple.getC() + triple.getC();

The JIT is pretty clever. If there would be any performance gain by removing temporary assignment to local variables, it will happily get rid of it. However, there is one caveat; the inline threshold for frequently executed methods depends on the bytecode size of the method; that is the original size of the bytecode. So if you make the bytecode of a method larger than needed, it can prevent inlining. And inlining is one of the most crucial optimizations because it provides the ability for further optimizations as we'll see.

Benchmark

The following JMH benchmark will determine if there is any performance impact.

package org.sample;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.CompilerControl;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OperationsPerInvocation;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;

import java.util.concurrent.TimeUnit;

import static org.openjdk.jmh.annotations.CompilerControl.Mode.DONT_INLINE;


@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 1)
@Measurement(iterations = 1)
@Fork(warmups = 1, value = 1)
@OperationsPerInvocation(InlineBenchmark.OPERATIONS)
public class InlineBenchmark {

    public static final int OPERATIONS = 1000 * 1000;

    @State(Scope.Benchmark)
    static public class Triple {
        int a;
        int b;
        int c;

        int getA() {
            return a;
        }

        int getB() {
            return b;
        }

        int getC() {
            return c;
        }
    }

    static int large(Triple triple) {
        int a = triple.getA();
        int b = triple.getB();
        int c = triple.getC();
        return a + b + c;
    }

    static int small(Triple triple) {
        return triple.getA() + triple.getB() + triple.getC();
    }

    @CompilerControl(DONT_INLINE)
    @Benchmark
    @Fork(jvmArgs = "-XX:FreqInlineSize=20")
    public int benchmark_small(Triple triple) {
        int v = 0;
        for (int k = 0; k < OPERATIONS; k++) {
            v = small(triple);
        }
        return v;
    }

    @CompilerControl(DONT_INLINE)
    @Benchmark
    @Fork(jvmArgs = "-XX:FreqInlineSize=20")
    public long benchmark_large_with_low_inline_size(Triple triple) {
        int v = 0;
        for (int k = 0; k < OPERATIONS; k++) {
            v = large(triple);
        }
        return v;
    }

    @CompilerControl(DONT_INLINE)
    @Benchmark
    @Fork(jvmArgs = "-XX:FreqInlineSize=21")
    public long benchmark_large_with_high_inline_size(Triple triple) {
        int v = 0;
        for (int k = 0; k < OPERATIONS; k++) {
            v = large(triple);
        }
        return v;
    }
}

There are 2 important methods. The 'small' method that does not use temporary local variables:

    static int small(Triple triple) {
        return triple.getA() + triple.getB() + triple.getC();
    }

And the 'large' method that does use temporary local variables:

    static int large(Triple triple) {
        int a = triple.getA();
        int b = triple.getB();
        int c = triple.getC();
        return a + b + c;
    }

There are 3 benchmark methods:

benchmark_short_sum: this will call the 'small' method in a loop. It is configured with the same inline size as the benchmark_long_sum_with_low_inline_size.
benchmark_long_sum_with_low_inline_size: this will call the 'large' method in a loop. The inline size has been set just above the bytecode size of the method. This will disable the inlining of the 'large' method.
benchmark_long_sum_with_high_inline_size: this will call the 'large' method in a loop. The inline size has been set to the bytecode size of the method. This will enable the inlining of the 'large' method.

If we look at the bytecode for the 'large' and 'small' method, we can see that the size of the bytecode for the 'large' method is 21 bytes and for the 'small' method it is 15.

  static int large(org.sample.InlineBenchmark$Triple);
    Code:
       0: aload_0
       1: invokevirtual #2                  // Method org/sample/InlineBenchmark$Triple.getA:()I
       4: istore_1
       5: aload_0
       6: invokevirtual #3                  // Method org/sample/InlineBenchmark$Triple.getB:()I
       9: istore_2
      10: aload_0
      11: invokevirtual #4                  // Method org/sample/InlineBenchmark$Triple.getC:()I
      14: istore_3
      15: iload_1
      16: iload_2
      17: iadd
      18: iload_3
      19: iadd
      20: ireturn

  static int small(org.sample.InlineBenchmark$Triple);
    Code:
       0: aload_0
       1: invokevirtual #2                  // Method org/sample/InlineBenchmark$Triple.getA:()I
       4: aload_0
       5: invokevirtual #3                  // Method org/sample/InlineBenchmark$Triple.getB:()I
       8: iadd
       9: aload_0
      10: invokevirtual #4                  // Method org/sample/InlineBenchmark$Triple.getC:()I
      13: iadd
      14: ireturn

So there is definitely a difference because the javac did not optimize the code (this is the task of the JIT).

In the above benchmark, the benchmark_large_with_high_inline_size has been configured with FreqInlineSize=21, which allows the 'large' method to be inlined and the benchmark_large_with_low_inline_size has been configured with FreqInlineSize=20 which prevents the 'large' method from being inlined.

Results:

Benchmark                                              Mode  Cnt   Score   Error  Units
InlineBenchmark.benchmark_large_with_high_inline_size  avgt       ≈ 10⁻⁶          ns/op
InlineBenchmark.benchmark_large_with_low_inline_size   avgt        1.306          ns/op
InlineBenchmark.benchmark_small                        avgt       ≈ 10⁻⁶          ns/op

As can be seen, benchmark_large_with_high_inline_size is a lot faster than the benchmark_large_with_low_inline_siz. We can also see that the 'small' method could still be inlined with a FreqInlineSize=20, where the 'large' method could not.

The reason why benchmark_large_with_high_inline_size and benchmark_small are so fast is because of inlining. The JIT could see what is happening inside the method in the loop. This prodived the opportunity for further optimization. In this case, there is no reason to execute the logic of the inlined method 1M times since the result will be the same for every iteration; just 1 time is sufficient.

Conclusions

Temporary assigning parameters to local variables can reduce performance when the inline threshold of the bytecode size of the method is exceeded. If the inline threshold isn't exceeded, there doesn't seem to be a performance impact.

This is not an excuse to remove all temporary assignments to local variables because local variables can increase clarity. Only when you have determined that a method is a bottleneck, you should optimize. Otherwise, clean code is preferred above premature optimizations.

dinsdag 9 november 2021

C++ atomic doesn't imply multi-copy atomicity

One thing that recently confused me, is that a C++ atomic with its atomicity guarantee does not guarantee multi-copy atomicity. In this post, I want to shed some light on the situation. I'm not a C++ expert, but I'm interested in hardware and software-level memory models.

Multi-copy atomicity

An access is multi-copy atomic (also called store/write atomicity) when there is a single logical moment where a store issued by some CPU will become visible to all other CPUs. It is allowed that the issuing CPU can see the store early e.g. due to store to load forwarding and therefore the name multi-copy atomic is a bit misleading.

The consequence of multi-copy atomicity is that stores issued by different CPUs to different addresses will be seen by other CPUs in the same order. Therefore multi-copy atomicity guarantees that some total order over the stores exists. Examples of multi-copy atomic systems are X86 with its TSO (Total Store Order) memory model and ARMv8.

When a single logical moment does not exist, the access is called non-multi-copy atomic. Some causes of accesses to become non-multi-copy atomic:

the store buffers are shared with a strict subset of the CPUs
a CPU that commits a store to the cache, doesn't wait for the cache line to be invalidated on all other CPUs

An example of a non-multi-copy atomic system is the PowerPC. The ARMv7 specification allows it as well, but all ARMv7 implementations are multi-copy atomic and to simplify the memory model, the specification was updated in ARMv8 to multi-copy atomic.

IRIW

The litmus test for multi-copy atomicity is the IRIW litmus test: independent reads of independent writes. The test is shown below:

CPU1:
   A=1
    
CP2:
   B=1
   
CPU3:
   r1=A
   [LoadLoad]
   r2=B
   
 CPU4:
   r3=B
   [LoadLoad]
   r4=A

Could it be that r1=1,r2=0,r3=1,r4=0? So could it be that the stores to A and B are seen in different orders? If accesses are multi-copy atomic, then this can't happen because stores to different addresses issued by different CPUs need to be seen in the same order. But when accesses are non-multi-copy atomic this can happen because stores issued by different CPUs to different addresses can be seen in different orders.

C++ atomic

With C++11 a new memory model was introduced including atomics. For this post, I'll only look at the loads/stores operations on the atomic and ignore RMW operations like compare_exchange_weak/strong. A load/store provides the following guarantees:

atomicity: the operations are indivisible and you don't end up with a value that is the combination of multiple reads/writes
synchronization: depending in the memory_order it will order surrounding loads/stores

By default all loads/stores use memory_order_seq_cst.

The problem in my understanding was that I assumed that multi-copy atomicity was part of the atomicity guarantee of the atomic independent of the memory_order. This is incorrect.

Only in the case of memory_order_seq_cst, multi-copy atomicity is implied because sequential consistency requires that some total order over all loads/stores exists and hence a total order over the stores exists. Therefore if the IRIW example would be implemented using memory_order_seq_cst, then the observation isn't possible.

But when a memory_order below memory_order_seq_cst is used, multi-copy atomicity is not implied. And as a consequence, if the IRIW example would be implemented using memory_order_release stores and memory_order_acquire loads, the given observation is possible.

Conclusion

Multi-copy atomicity and atomicity are different concepts and atomicity doesn't imply multi-copy atomicity. They are overloaded terms and it is very easy to get confused.

Update

It seems there are conflicting single-copy atomicity definitions as can be seen in this Twitter discussion. In some academic papers, single-copy atomic is a stricter version of multi-copy atomic whereby even the issuing CPU needs to see the store in the same order. But in the ARM and PowerPC reference manuals, it effectively means that when you load/store some value, you don't end up with a value that is a mix of multiple loads/stores. It doesn't say anything about the order of stores issued by different CPUs to different addresses.

The original version of this post was based on the stricter version, but it seems that the ARM/PowerPC definition is more widely used and therefore I updated the post. I would like to thank Alexey Shipilev and Travis Downs to point out the conflict.

dinsdag 20 juli 2021

TSO and IBM System/370

[Document is currently being updated due to new insights] The IBM System/370 (IBM 370) is mainframe from the 70s.

You would probably wonder why I would discuss a machine a few people have heard of and even fewer people will ever need to work with. To understand memory models like those from Java and C++ and to understand performance implications, I find it useful to have some insights into hardware memory models, how they are implemented, and especially how they are related. The reason why the IBM 370 is so interesting is that it shows a very nice transistion from sequential consistency (SC) to total store order (TSO).

TSO is the memory model that is used on the X86, but there are ISA's like the SPARC V8/v9 that offer TSO. A performance advantage of TSO over sequential consistency (SC) is that when a CPU does a store and there is a write miss, the CPU needs to wait for the cache line to be invalidated on all CPUs that use this cache line before it can write to the cache-line or it needs to wait for main memory. As a consequence, the performance of the CPU is reduced because it can't execute other instructions since this could lead to loads/stores being performed out of order in the memory order. Effectively it isn't able to hide the memory latency.

Instead of blocking, since non-speculative stores are going to be written to the cache anyway, the stores are written to a store buffer that sits between the CPU and the cache. The stores in the store buffer are in program order and will be committed to the cache in program order.

Store/Load reordering

Because stores are buffered, this can lead to store being reordered, in the memory order, with a load to a different address. This is demonstrated with the load buffering litmus test.

Initial:
   X=0, Y=0

CPU1:
   X=1
   r1=Y

CPU2:
   Y=1
   r2=X

Can it be that r1=0, r2=0?

With SC this can't happen because no reordering is allowed (PO needs to be preserved in the memory order) and therefore an earlier load can't be reordered with a later store to a different address. But with TSO this can happen due to store buffers; even though the store retires before the load retires, the store is placed in the memory order after the load was placed in the memory order.

So with SC, the memory order will respect the PO of each thread; in other words it will be consistent with all 4 fences [LoadLoad][LoadStore][StoreLoad][StoreStore]. But since with TSO an older store can be reordered with a newer load to a different address, the memory order will only be consistent with preserved program order of each CPU: that is [LoadLoad][LoadStore][StoreStore] since the [StoreLoad] fence is be dropped.

Load/Store reordering

Is it allowed for an earlier load to be reordered with a later store to a different address? This can be expressed using the load buffering litmus test:

Initial:
   X=0, Y=0

CPU1:
  r1=X
  Y=1
  
CPU2:
   r2=Y
   X=1

Is the outcome r1=1 and r2=1 possible?

With TSO this isn't allowed to reorder an older load with newer store to a different address. Think about it; the load is globally performed (read from cache) before it retires. The store will retire after the load and the store will only be globally performed (written to cache) after it retires. So an earlier load to a different address will be performed in the memory order before a later store to a different address.

Load/Load and Store/Store reordering

Is it allowed for an earlier load to be reordered with a later load to a different address? And the same goes for 2 stores? This is expressed using the message passing litmus test:

Initial:
   X=0, Y=0

CPU1:
  X=1
  Y=1

CPU2:
  r1=Y
  r2=X

Can it be that r1=1 and r2=0?

With TSO this isn't allowed because it will preserve [LoadLoad] and [StoreStore]. The Intel X86 does make use of speculative out-of-order execution of loads, but if it detects a potential reordering, it will flush the pipeline and try again. This situation is called a memory order violation.

Store to Load forwarding (STLF)

One difference between SC and TSO is that TSO drops the [StoreLoad] fence. But there is another very important difference, with SC every load/store is atomic (such a store is called single copy atomic) and you get the guarantee that at least one total order exist.

TSO doesn't provide single copy atomic stores due to the store buffer. So imagine the following program:

CPU1:
   A=1
   r1=A

If the load would not look in the store buffer for the store, the CPU would not see its own store. So with TSO, the CPU can look into the store buffer before the CPU is committed to the cache. This is called store to load forwarding (STLF). The problem is that the CPU can now see its own store before other CPUs can. Such a store is called a multi-copy atomic since there are 2 copies; one in the store buffer and one in the cache.

This can lead to some weird behavior as is demonstrated with the following test:

Initial:
   X=0, Y=0

CPU1:
   X=1
   r1=X
   r2=Y

CPU2:
   Y=1
   r3=Y
   r4=X

Can it be that r1=1, r3=1, and r2=0 and r4=0? So could it be that the r2=Y jumps before r1=X? And could r4=X jump before r3=Y? With TSO this is allowed because X=1 could be globally performed after r2=Y due to store buffers. r1=X can only be globally performed after X=1 is globally performed, so this means that r1=X is globally performed after r2=Y is globally performed. So effectively the store of X in the store buffer carries the load of X in the global memory order after the load of Y.

Keep in mind that globally performed isn't the same as executed because the r1=X could very well be executed before r2=Y.

The consequence is that in the memory order, X=1 happened before r1=X. But the only way to get this outcome is if r1=X is performed before X=1. This means that the store happened before the load and the load happened before the store; so we have a cycle on the same address. The consequence is that the memory order can't always order loads/stores to the same address and the memory order doesn't provide a total order over all loads/stores. This is a direct consequence of mult-copy store atomicity.

Total order over stores

Luckily, multi-copy store atomicity will still define a total order over the stores; so there is still a single moment in time where the store comes globally visible (when it commits to the cache). That is how TSO gets its name. So it will guarantee that at least 1 total order exist over stores issued by different CPU's to different addresses. This can be demonstrated with the Independent read of independent writes (IRIW) litmus test:

Initial:
  X=0, Y=0
  
CPU1:
  X=1
  
CPU2:
  Y=1
  
CPU3:
  r1=X
  r2=Y
  
CPU4:
  r3=Y
  r4=X

Can it be that r1=1, r2=0, r3=1, r4=0; so could it be that the CPUs see the changes to different addresses issued by different CPUs in different orders? With TSO this isn't allowed because of the total order over the stores.

IBM 370

So how does the IBM 370 fit into the picture? The IBM 370 sits exactly between SC and TSO. As with SC, IBM 370 requires a total order over all loads/stores. And as with TSO, older stores can be reordered to newer loads to a different address; so the memory order will preserve [LoadLoad][LoadStore][StoreStore].

If there is a store followed by a load to the same address, it requires that the store becomes globally visible before the load can be performed. What is the point if the same value is returned for the load as with TSO? It will prevent the reordering of an earlier load that can be satisfied using STLF and a later load that need to come from the cache and this will ensure that the store is single-copy store atomic instead of multi-copy store atomic.

I'll explain it using the previous example.

Initial:
   X=0, Y=0

CPU1:
   X=1
   r1=X
   r2=Y

CPU2:
   Y=1
   r3=Y
   r4=X

With TSO, the r1=X and r2=Y can be reordered due to STLF. But with IBM 370, the load of r1=X can only be globally performed after the X=1 is globally performed and the load of r2=Y can only be globally performed after r1=X is globally performed, so this forces the loads to be performed in order. And as a consequence we can't end up with r1=1, r2=0, r3=1,r4=0.

For more information check Shared Memory Consistency Models: A Tutorial.

Simple thoughts & complex stuff