woensdag 21 juli 2021

What is coherence?

The goal of this post is to explain what coherence means.

Coherence is a memory model just like sequential consistency (SC) or total store order (TSO). The big difference is that regular memory models (aka memory consistency models) define an ordering of load/stores over different locations and coherence only defines an ordering per location; so it doesn't provide any ordering between different locations.

Coherence is one of the most fundamental memory models because quite a few other memory models require coherence, e.g. SC/TSO. In other words, every SC/TSO execution is always coherent, but not all coherent executions are SC/TSO. So since SC/TSO is built on top of coherence, to understand them, you need to understand coherence.

Definition

For an execution to be coherent, the results of that execution should be indistinguishable of a different execution that has a memory order with the following properties:
  1. total order per location
  2. a load will return the value of the most recent stored value before it in the memory order
  3. consistent with the program order (PO) of each CPU.
So the actual memory order isn't relevant; we just need to be able to pretend an acceptable memory order exists that explains the result of the execution. This way of defining the memory order is very similar to that of sequential consistency.

A different definition is using the coherence invariants:
  1. Single-Writer, Multiple-Reader (SWMR) invariant: for any location at any given time, only 1 can write (and read) from it, or one or more cores can read from it. This creates a series of write and read-only epochs.
  2. Data-Value invariant: the value of a location at the start of an epoch is the same as the last written value at the most recent read/write epoch before it.

Total order per location

A total order over all loads and stores per address in simple terms means that everyone is agreeing on which order the loads/stores per address have happened (nobody is disagreeing).

This is demonstrated with the following coherence litmus test:
initial:
   A=0
CPU1:
   A=1
CPU2:
   A=2
CPU3:
   r1=A
   r2=A
CPU4:
   r3=A
   r4=A
It is possible to end up with r1=1, r2=2, r3=2, r4=1? In simpler terms, can it be that CPUs see the changes in different orders? No, because then there would not be a total order because CPU3 believes that store A=1 happened before store A=2 and CPU4 believes exactly the opposite. In mathematical terms, it would mean that there is a cycle in the order and cycles lead to problems due to causality loops.

A load will return the value...

A load needs to return the value of the most recent store before it in the memory order. To demonstrate this behavior, I'll use the following coherence litmus test:
initial:
   A=0
CPU1:
   A=1
CPU2:
   r1=A
   r2=A
Can it be that we end up with r1=1 and r2=0? So could it be that we first read a newer version and later we read an older version? With coherence, this isn't allowed because load r1=A has seen the store A=1, and since load r1=A is ordered before load r2=B, r2=B also needs to see store A=1.

If you have a close look at the definition for coherence, you will notice that there are no real-time requirements and therefore coherence doesn't need to respect the real-time order of loads/stores, similar to SC and unlike linearizability. In simple terms: a load doesn't need to see the most recent stored value. E.g. if we measure that the store A=1 is 1 minute earlier than the load r1=A (different CPU), then the load r1=A is allowed to see the value 0. Even though the load is ordered in the real-time order after the store, in the memory order it is ordered before the store. For the sake of coherence: loads/stores issued by a single CPU can be skewed as long as they are not reordered with other loads/stores to the same address. It depends on the coherence requirements if loads/stores can be reordered with loads/stores to different addresses.

On a hardware level, there is a bound on the amount of skewing that is possible. Store buffers have a finite capacity so eventually, a change needs to become visible. On a compiler level, things are more tricky. The canonical example is a thread that stops when a flag is set.
boolean stop = false;

void run(){
    while(!stop){
        ...
    }
}
The compiler could hoist the stop condition out of the loop because in the loop it doesn't change. From a coherence perspective, this isn't a problem because reads are just skewed to an earlier time.
if(!stop){
   while(true){
      ...
   }
}
This is where the compiler needs to be helped out to indicate the stop variable should be visible instead of being skewed indefinitely. In Java, this can be fixed by making stop volatile and in C++ you would use an atomic.

Coherence is more than cache-coherence

Modern CPUs have coherent caches as most of you will know. But coherence can't only be solved at the level of caches because either the CPU or compiler could mess things up. In short: CPUs and compilers need to guarantee that loads/stores to the same address don't get reordered (or nobody observes it was reordered).

Modern Intel processors allow for out of order execution using the Tomasulo algorithm. But what prevents the following 2 stores to commit to the cache out of order?
CPU1:
   A=1
   A=2
Instructions are issued to the reorder buffer (ROB) in PO. If the instruction is a store, a slot is allocated in the store-buffer and this will also be done in PO. And stores from the store-buffer will commit to the cache in order, so stores will end up in the cache in PO.

Modern Intel processors also make use of load-buffers and lockup-free caches; so instead of blocking the CPU in case of a read miss, the CPU will just place the load-request in a load buffer and keep execution other instructions from the ROB that are ready. Once the load completes, the CPU will continue with the load instruction. So loads are speculatively executed out of order. But what prevents 2 loads to the same address from being reordered if they can be executed out of order?
CPU1:
   r1=A
   r2=A
Just as with the store buffer, a load is assigned a slot in the load buffer in PO. It could be that load r2=A completes first and as long as the load r1=A sees the same value eventually, all is good. But if load r2=A completes and the earlier load r1=A is still pending and a store to the A is detected, the pipeline is flushed and the instructions are retried. This is called a memory order violation. A pipeline flush will not only be triggered for a coherence violation, it will also be triggered for a consistency violation. If there would be an earlier load for X and a later load for Y that is completed and a store to Y is detected, also a pipeline flush is triggered. This is because TSO doesn't allow loads to be reordered.

Some of you might know the Java Thread.onSpinWait() method that has been added by Gil Tene. It is useful when a thread needs to wait using spinning on some condition, e.g:
volatile boolean eventNotReceived = true;

void run(){
   while(eventNotReceived);
   processEvent()
}
What happens is that the load buffer will be filled with speculatively executed loads of 'eventNotReceived'. Probably there will be a later load that has been completed by reading the value from the cache and an earlier one is still pending. But as soon as the eventNotReceived is set to false, we'll end up with a memory order violation. So there is a high chance that every time you exit the loop, you run into this performance problem!

The Thread.onSpinWait solves this problem:
while(eventNotReceived){
   Thread.onSpinWait();
}
processEvent()
On the X86 it translates to the PAUSE instruction which will cause the pipeline to delay the execution of the next instruction to prevent the pipeline from being filled with speculative loads. It also has other nice side effects like lowering power consumption and giving more space to the hypersibling The ARM has something similar called YIELD.

And of course, a compiler should not violate coherence. So compilers can move loads and stores. Perhaps they will merge multiple stores into a single store or perhaps they will merge multiple loads into a single load by placing it in a register. Other cores can't can see that so nobody will disagree with what has been done. But compilers can't reorder loads and stores to the same location (unless it can't be observed). And coherence doesn't require that you need to access cache/memory for every load/store, because the real-time order doesn't need to be preserved. So coherence itself doesn't rule out compiler optimizations; the problem is with consistency. Certain order between loads/stores to different addresses to prevented depending on the coherence.

Can a system with cache be coherent without cache-coherence? Obviously, it can't be because if the cache wouldn't be coherent, you would violate consistency (if it requires coherence). So for people that believe that the JMM can be implemented correctly on an incoherent cache; let me burst that bubble. If you have JMM and your system is using cache, the cache needs to be coherent. I believe the key misunderstanding is that people believe that coherence requires real-time guarantees and therefore it rules out many compiler optimizations. Another thinking error is that people believe that for coherence, you need to 'flush' to main memory; which is incorrect because caches on modern CPUs are always coherent.

Cache-coherence implementation

Caches on modern CPUs are always coherent. I'm not going to dig too much into the details of cache-coherence implementations; enough is written about that. But generally, the idea is that before a store can be placed in the cache, all copies of the cache-line need to be revoked on all CPUs using some MESI-like protocol. Before the store can be committed, the store needs to wait until the other CPUs have acknowledged the invalidation. Generally, there are 2 approaches for cache-coherence; snooping for a relatively small number of cores and directory-based for a larger number of cores. Due to the way cache-coherence is implemented, it possible that cache-implementations do respect the real time order of requests.

Coherence vs SC

For an execution to be SC, the results of that execution should be indistinguishable of a different execution that has a memory order with the following properties:
  1. A total order over all loads and stores over all locations
  2. A read will return the most recent store before it in the memory order
  3. Consistent with the PO of each CPU

As you can see the difference is that coherence is per location and SC is over multiple locations. Coherence can be seen as SC per location. The difference between SC and coherence is demonstrated with the following message passing litmus test:
initial:
   A=0,B=0
CPU1:
   A=1
   B=1
CPU2:
   r1=B
   r2=A
Can it be that we end up with r1=1 and r2=0? So could it be that either the loads or stores got reordered? With SC this isn't allowed because loads/stores need to be placed in the memory order in PO. But coherence only orders loads and stores per location, so it allows this outcome. This is a good example that shows that reordering of loads and stores by the CPU/Compiler for consistency is much more restrictive than for coherence.

Literature

For a great read on this topic, and more details about the coherence definition, I can't recommend the A Primer on Memory Consistency and Cache Coherence second edition enough. And it is offered for free!

Update

The JMM offers SC-DRF and not SC; so only when program is correctly synchronized, it will offer SC behavior. I want to confirm that this doesn't allow for a more relaxed behavior regarding coherence than plain SC. I think that even for an incorrectly synchronized program, coherence need to be maintained; otherwise there would not be a causal order due to a causality loop and the JMM would be violated since the JMM is defined in terms of coherence and causality and causality needs to be proserved independent of the correct synchronization of a program. I hope to find time to dig into this in the coming days.

dinsdag 20 juli 2021

TSO and IBM System/370

The IBM System/370 (IBM 370) is mainframe from the 70s.
You would probably wonder why I would discuss a machine a few people have heard of and even fewer people will ever need to work with. To understand abstract memory models like those from Java and C++ and to understand performance implications, I find it useful to have some insights into hardware memory models, how they are implemented, and especially how they are related.

Total Store Order (TSO) is the memory model that is used on the X86, but there are ISA's like the SPARC V8/v9 that offer TSO. The problem that TSO solves with respect to sequential consistency (SC) is that when a CPU does a write and there is a write miss, the CPU needs to wait for the cache line to be invalidated on all CPUs that use this cache line before it can write to the cache-line. As a consequence, the performance of the CPU is reduced because it can't execute other instructions since this could lead to loads/stores being performed out of order in the global memory order.

So instead of blocking, since non-speculative stores are going to be written to the cache anyway, the stores are written to a store buffer that sits between the CPU and the cache. The stores in the store buffer are in program order and will be committed to the cache in program order.

Store/Load reordering

Because stores are buffered, this can lead to loads/stores being applied in the global memory order in a different order than the program order. A reordering that is possible is that an older store is reordered with a newer load to a different address as demonstrated by the store buffering litmus test.
Initial:
   X=0, Y=0

CPU1:
   X=1
   r1=Y

CPU2:
   Y=1
   r2=X
Can it be that r1=0, r2=0?

With SC this can't happen because no reordering is allowed and therefore an earlier load can't be reordered with a later store to a different address. But with TSO this can happen due to store buffers; even though the store retires before the load retires, the store could be placed in the memory order after the load was placed in the memory order.

With SC you get a total order over all loads/stores. But because loads and stores to different addresses can be reordered, TSO will only provide a total order over all stores because stores are atomic and not reordered with other stores. The total order over all stores can be demonstrated with the Independent read of independent writes (IRIW) litmus test:
Initial:
  X=0, Y=0
  
CPU1:
  X=1
  
CPU2:
  Y=1
  
CPU3:
  r1=X
  r2=Y
  
CPU4:
  r3=Y
  r4=X
Can it be that r1=1, r2=0, r3=1, r4=0; so could it be that the CPUs see the changes to different addresses issued by different CPUs in different orders? With TSO this isn't allowed and this is done by making the load and store atomic. In a future post, I'll explain why this will give a total order over the stores.

Load/Store reordering

Is it allowed for an earlier load to be reordered with a later store to a different address? This can be expressed using the load buffering litmus test:
Initial:
   X=0, Y=0

CPU1:
  r1=X
  Y=1
  
CPU2:
   r2=Y
   X=1
Is the outcome r1=1 and r2=1 possible on TSO? No, an older load can't be reordered with a newer store to a different address. Think about it; the load is globally performed (read from cache) before it retires. The store will retire after the load and the store will only be globally performed (written to cache) after it retires. So an earlier load to a different address will be globally performed before a later store.

Load/Load and Store/Store reordering

Is it allowed for an earlier load to be reordered with a later load to a different address? And the same goes for 2 stores? This is expressed using the message passing litmus test:
Initial:
   X=0, Y=0

CPU1:
  X=1
  Y=1

CPU2:
  r1=Y
  r2=X
Can it be that r1=1 and r2=0?
With TSO this can't happen because stores can't be reordered with other stores and loads can't be reordered with other loads. The Intel X86 does make use of speculative out-of-order execution of loads, but if it detects a potential reordering, it will flush the pipeline and try again. This situation is called a memory order violation.

Store to Load forwarding (STLF)

In the "Load/Load and Store/Store reordering" section, I explain that 2 loads can't be reordered. Well, I lied because there is a very important edge case.

What happens where there is a store to A followed by a load from A followed by a load from B. Once the store of A is added to the store buffer, the load of A has a few options. The first option would be to ignore the store buffer and the load from the cache. The big problem is that the load would not see the previous store and we end up with a violation of the program order. Since this isn't acceptable, a better approach is to let the load look in the store buffer for a matching store. This is called store-to- load-forwarding.

But this leads to an ordering problem because the load of A and B can now be reordered in the global memory order which is demonstrated with the following test:
Initial:
   X=0, Y=0

CPU1:
   X=1
   r1=X
   r2=Y

CPU2:
   Y=1
   r3=Y
   r4=X
Can it be that r1=1, r3=1, and r2=0 and r4=0? So could it be that the r2=Y jumps before r1=X? And could r4=Y jump before r3=Y? With TSO this is allowed because X=1 could be globally performed after r2=Y and r1=X can only be globally performed after the X=1 is globally performed, so it could be that r1=X is globally performed after r2=Y. A load can only be globally performed after the store it reads from is globally performed. So effectively the store of X in the store buffer carries the load of X in the global memory order after the load of Y.

Keep in mind that globally performed isn't the same as executed because the r1=X could very well be executed before r2=Y.

IBM 370

So how does the IBM 370 fit into the picture? The IBM 370 provides TSO but with 1 addition. If there is a store followed by a load to the same address, it requires that the store becomes globally visible before the load can be performed. What is the point if the same value is returned for the load as with TSO? It will prevent those 2 loads can be reordered!

I'll explain it using the previous example.
Initial:
   X=0, Y=0

CPU1:
   X=1
   r1=X
   r2=Y

CPU2:
   Y=1
   r3=Y
   r4=X
With TSO, the r1=X and r2=Y can be reordered due to STLF. But with IBM 370, the load of r1=X can only be globally performed after the X=1 is globally performed and the load of r2=Y can only be globally performed after r1=X is globally performed, so this forces the loads to be performed in order. And as a consequence we can't end up with r1=1, r2=0, r3=1,r4=0.

TSO gives a total order over all stores because stores are atomic and can't be reordered with other stores.

With IBM System/370, we also get a total order over all loads because loads are atomic and can't be reordered with other loads. So IBM 370 gives a total order over all loads and a total order over all stores. So it sits nicely between SC and TSO.

For more information check Shared Memory Consistency Models: A Tutorial.

vrijdag 4 augustus 2017

Percentiles and mean

For my work as performance engineer for Hazelcast, I need to deal with quite a lot of statistics. Unfortunately statistics have never been my strongest topic; I really wish I paid more attention at high school. Some basic understanding is critical if you want to do a proper analysis of test data and in this post I'll be zooming into the problem of combining percentiles with means.

For this post I'll be focusing on Latency test: executing requests with a fixed rate and recording the response times. In case of latency based tests, mean latency and standard deviation are not very useful; you want to look at the percentiles e.g. p99 or p99.9. In this post we'll focus on the p99.

The test data

For this test we assume the following dataset: there are 200 requests (2 per second) for 100 seconds and the following latency distribution:
  • 190 measurements are 1ms
  • 10 measurements are 400ms: these are equally spread over the 200 requests

p99

The 99 percentile (p99) is 400ms since 99% of all measures values are equal or smaller than 400ms. This is easy to see; just sort sorting all items from small to large and jump to the 198 measurement which is 400ms, since the measurements from 190 to 200 are 400ms.

Click on the following link if you don't trust me.

p99 per 1 second interval

Having a p99 is very useful, but it won't help you if there are fluctuations (e.g. gc) during the test and you want to understand when this happened. For this you need to know how percentiles behave over time. In this case it is useful to determine the latency distribution on e.g. a 1 second interval instead of the whole distribution of 100 seconds.

To calculate p99 per interval of 1 second for our test we'll get:

  • 10 intervals with 2 measurements; 1ms + 400ms.
  • 90 intervals with 2 measurements; 1ms + 1ms.

If we calculate the p99 for each of these intervals, then we get:

Each of these intervals is coupled to a certain time so if the first bad intervals happens 24 seconds after the test starts, you can determine the absolute time and check for example the gc log.

Where it goes wrong: combining mean with percentiles

When you look at an interval p99 plot, it feels kind of normal to determine a the p99 based on the mean of these intervals. In this case the mean p99 would be:
(90*1+10*400)/100=40.9ms
This is an order of magnitude less than the of the actual p99 of 400ms!

The correct way to determine the p99 based on these intervals, is to merge each of the latency distributions into a single latency distribution, and then determine the p99.

In the above example we played with mean in combination with interval p99, but it can just as easily be applied if you for example have multiple machines generating load and you want to combine the data of each of these machines. If you determine the p99 using the mean of the p99 of each of these machines, you get the same calculation error.

I hope that after reading this post, you start to become worried as soon as you see percentiles being combined with mean.

Gil Tene

Gil Tene has an excellent presentation any self respecting performance engineer should have watched. Where he zooms into typical problems including coordinated omission, percentiles and all kinds of statistics errors including combining mean with percentiles. Gil also has written the excellent HdrHistogram tool to track for example latency.

zondag 23 april 2017

JIT and instanceof

One of the things that always wondered me is how expensive an instanceof check is. In most cases this is never a concern, but out of professional curiosity I wondered how well the JIT is able to optimize it. Lets start with the simplest case, an instanceof of check using a final class as second argument. For this we use the following example:
public class InstanceOfCheck {

    public static void main(String[] args) {
        for (int k = 0; k < 100000; k++) {
            doSomething("");
        }
        System.out.println(doSomething(""));
    }

    public static String doSomething(Object o) {
        if (o.getClass() instanceof String.class) {
            return "string";
        } else {
            return "nostring";
        }
    }
}
For this post we use the following Java version:
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
I'm too lazy to update the JDK for my MacBook which I hardly use. And the following JVM parameters:
-XX:+UnlockDiagnosticVMOptions 
-XX:PrintAssemblyOptions=intel 
-XX:-TieredCompilation 
-XX:-Inline 
-XX:-BackgroundCompilation 
-XX:CompileCommand=print,*InstanceOfCheck.doSomething
-XX:+PrintCompilation 
This will give us the Assembly of the 'doSomething' method. When I run the above example, I get the following output:
    125    1    b        java.lang.String::indexOf (70 bytes)
    128    2 %  b        java.lang.String::indexOf @ 37 (70 bytes)
    137    3    b        java.lang.String::hashCode (55 bytes)
    192    4    b        sun.nio.cs.UTF_8$Encoder::encode (359 bytes)
    213    5    b        java.lang.String::charAt (29 bytes)
    216    6    b        java.io.UnixFileSystem::normalize (75 bytes)
    220    7    b        java.lang.String::lastIndexOf (52 bytes)
    243    8    b        com.InstanceOfCheck::doSomething (13 bytes)
Compiled method (c2)     244    8             com.InstanceOfCheck::doSomething (13 bytes)
 total in heap  [0x0000000106479e10,0x000000010647a090] = 640
 relocation     [0x0000000106479f30,0x0000000106479f48] = 24
 main code      [0x0000000106479f60,0x0000000106479fc0] = 96
 stub code      [0x0000000106479fc0,0x0000000106479fd8] = 24
 oops           [0x0000000106479fd8,0x0000000106479fe0] = 8
 metadata       [0x0000000106479fe0,0x0000000106479ff0] = 16
 scopes data    [0x0000000106479ff0,0x000000010647a018] = 40
 scopes pcs     [0x000000010647a018,0x000000010647a078] = 96
 dependencies   [0x000000010647a078,0x000000010647a080] = 8
 nul chk table  [0x000000010647a080,0x000000010647a090] = 16
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x0000000106479e10:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x000000010e77a480} 'doSomething' '(Ljava/lang/Object;)Ljava/lang/String;' in 'com/InstanceOfCheck'
  # parm0:    rsi:rsi   = 'java/lang/Object'
  #           [sp+0x20]  (sp of caller)
  0x0000000106479f60: mov    DWORD PTR [rsp-0x14000],eax
  0x0000000106479f67: push   rbp
  0x0000000106479f68: sub    rsp,0x10           ;*synchronization entry
                                                ; - com.InstanceOfCheck::doSomething@-1 (line 14)

  0x0000000106479f6c: mov    r11d,DWORD PTR [rsi+0x8]  ; implicit exception: dispatches to 0x0000000106479fa1
  0x0000000106479f70: cmp    r11d,0xf80002da    ;   {metadata('java/lang/String')}
  0x0000000106479f77: jne    0x0000000106479f8f
  0x0000000106479f79: movabs rax,0x7955dfce0    ;   {oop("string")}
  0x0000000106479f83: add    rsp,0x10
  0x0000000106479f87: pop    rbp
  0x0000000106479f88: test   DWORD PTR [rip+0xffffffffff73f072],eax        # 0x0000000105bb9000
                                                ;   {poll_return}
  0x0000000106479f8e: ret    
  0x0000000106479f8f: mov    rbp,rsi
  0x0000000106479f92: mov    esi,0xffffffde
  0x0000000106479f97: call   0x0000000106447120  ; OopMap{rbp=Oop off=60}
                                                ;*instanceof
                                                ; - com.InstanceOfCheck::doSomething@1 (line 14)
                                                ;   {runtime_call}
  0x0000000106479f9c: call   0x0000000104fdfe44  ;   {runtime_call}
  0x0000000106479fa1: mov    esi,0xfffffff4
  0x0000000106479fa6: nop
  0x0000000106479fa7: call   0x0000000106447120  ; OopMap{off=76}
                                                ;*instanceof
                                                ; - com.InstanceOfCheck::doSomething@1 (line 14)
                                                ;   {runtime_call}
  0x0000000106479fac: call   0x0000000104fdfe44  ;*instanceof
                                                ; - com.InstanceOfCheck::doSomething@1 (line 14)
                                                ;   {runtime_call}
  0x0000000106479fb1: hlt    
  0x0000000106479fb2: hlt    
  0x0000000106479fb3: hlt    
  0x0000000106479fb4: hlt    
  0x0000000106479fb5: hlt    
  0x0000000106479fb6: hlt    
  0x0000000106479fb7: hlt    
  0x0000000106479fb8: hlt    
  0x0000000106479fb9: hlt    
  0x0000000106479fba: hlt    
  0x0000000106479fbb: hlt    
  0x0000000106479fbc: hlt    
  0x0000000106479fbd: hlt    
  0x0000000106479fbe: hlt    
  0x0000000106479fbf: hlt    
[Exception Handler]
[Stub Code]
  0x0000000106479fc0: jmp    0x000000010646bf60  ;   {no_reloc}
[Deopt Handler Code]
  0x0000000106479fc5: call   0x0000000106479fca
  0x0000000106479fca: sub    QWORD PTR [rsp],0x5
  0x0000000106479fcf: jmp    0x0000000106446d00  ;   {runtime_call}
  0x0000000106479fd4: hlt    
  0x0000000106479fd5: hlt    
  0x0000000106479fd6: hlt    
  0x0000000106479fd7: hlt    
OopMapSet contains 2 OopMaps

#0 
OopMap{rbp=Oop off=60}
#1 
OopMap{off=76}
    247    9 %  b        com.InstanceOfCheck::main @ 2 (32 bytes)
    248    9 %           com.InstanceOfCheck::main @ -2 (32 bytes)   made not entrant
string
After filtering out the non relevant parts, the following remains:
  0x0000000106479f6c: mov    r11d,DWORD PTR [rsi+0x8]  ; implicit exception: dispatches to 0x0000000106479fa1
  0x0000000106479f70: cmp    r11d,0xf80002da    ;   {metadata('java/lang/String')}
  0x0000000106479f77: jne    0x0000000106479f8f
  0x0000000106479f79: movabs rax,0x7955dfce0    ;   {oop("string")}
  0x0000000106479f83: add    rsp,0x10
  0x0000000106479f87: pop    rbp
  0x0000000106479f88: test   DWORD PTR [rip+0xffffffffff73f072],eax        # 0x0000000105bb9000
                                                ;   {poll_return}
  0x0000000106479f8e: ret    
The first instruction loads the class address of the 'o' argument in register r11d. And the second instruction it is compared with the String class address. If the address is not equal, we jump to the uncommon trap since we have only passed String arguments to the 'doSomething' method. So it seems that doing an instanceof check using a final class is very cheap. And we get exactly the same Assembly as direct address comparison in the Java code:
    public static String doSomething(Object o) {
        if (o.getClass() == String.class) {
            return "string";
        } else {
            return "nostring";
        }
    }
Which gives the following relevant Assembly:
 0x000000010e0b296c: mov    r11d,DWORD PTR [rsi+0x8]  ;*invokevirtual getClass
                                                ; - com.InstanceOfCheck::doSomething@1 (line 14)
                                                ; implicit exception: dispatches to 0x000000010e0b29b1
  0x000000010e0b2970: cmp    r11d,0xf80002da    ;   {metadata('java/lang/String')}
  0x000000010e0b2977: jne    0x000000010e0b298f  ;*if_acmpne
                                                ; - com.InstanceOfCheck::doSomething@6 (line 14)

  0x000000010e0b2979: movabs rax,0x7955db818    ;   {oop("string")}
  0x000000010e0b2983: add    rsp,0x10
  0x000000010e0b2987: pop    rbp
  0x000000010e0b2988: test   DWORD PTR [rip+0xffffffffff72a672],eax        # 0x000000010d7dd000
                                                ;   {poll_return}
  0x000000010e0b298e: ret    

Non final but concrete class

In the previous example we saw that the JIT perfectly optimized an instanceof check comparing with a final class. But what about a non final concrete class like e.g. LinkedList? Will the JIT be able to optimize that? For this we modify the example:
public class InstanceOfCheck {
    
    public static void main(String[] args) {
        List l = new LinkedList();
        for (int k = 0; k < 100000; k++) {
            doSomething(l);
        }
        System.out.println(doSomething(l));
    }

    public static String doSomething(Object o) {
        if (o.getClass() == LinkedList.class) {
            return "linkedlist";
        } else {
            return "nolinkedlist";
        }
    }
}
If we extract only the relevant Assembly:
  0x000000010307ccec: mov    r11d,DWORD PTR [rsi+0x8]  ;*invokevirtual getClass
                                                ; - com.InstanceOfCheck::doSomething@1 (line 15)
                                                ; implicit exception: dispatches to 0x000000010307cd31
  0x000000010307ccf0: cmp    r11d,0xf8008c83    ;   {metadata('java/util/LinkedList')}
  0x000000010307ccf7: je     0x000000010307cd0f  ;*if_acmpne
                                                ; - com.InstanceOfCheck::doSomething@6 (line 15)

  0x000000010307ccf9: movabs rax,0x7955dd8d0    ;   {oop("nostring")}
  0x000000010307cd03: add    rsp,0x10
  0x000000010307cd07: pop    rbp
  0x000000010307cd08: test   DWORD PTR [rip+0xffffffffff6642f2],eax        # 0x00000001026e1000
                                                ;   {poll_return}
  0x000000010307cd0e: ret    
This Assembly looks very similar to the String version. The JIT sees that we only have passed arguments of type LinkedList and therefor it optimizes this with a cheap class address comparison and adds an uncommon trap in case of a different argument being passed.

Non concrete class check?

What would happen if we would check instanceof using a non concrete class like List. In this case it isn't possible to do a cheap class comparison:
public class InstanceOfCheck {
    
    public static void main(String[] args) {
        List l = new LinkedList();
        for (int k = 0; k < 100000; k++) {
            doSomething(l);
        }
        System.out.println(doSomething(l));
    }

    public static String doSomething(Object o) {
        if (o instanceof List) {
            return "list";
        } else {
            return "nolist";
        }
    }
}
When we extract the relevant Assembly we get the following:
  0x000000010807e66c: mov    r11d,DWORD PTR [rsi+0x8]  ; implicit exception: dispatches to 0x000000010807e6a1
  0x000000010807e670: cmp    r11d,0xf8008c83    ;   {metadata('java/util/LinkedList')}
  0x000000010807e677: jne    0x000000010807e68f
  0x000000010807e679: movabs rax,0x7955ddf58    ;   {oop("list")}
  0x000000010807e683: add    rsp,0x10
  0x000000010807e687: pop    rbp
  0x000000010807e688: test   DWORD PTR [rip+0xffffffffff61f972],eax        # 0x000000010769e000
                                                ;   {poll_return}
  0x000000010807e68e: ret    
This is a nice surprise! Because we have only passed LinkedLists as arguments, the instanceof with List as argument is implemented as a cheap pointer comparison with LinkedList class-address. If anything else than a LinkedList is passed, an uncommon trap is executed. So this means that even in this case the JIT is able to optimize!

maandag 30 januari 2017

JIT: Field of interface type and single implementation.

In my previous post the topic was devirtualization; an optimization applied by the JIT to reduce the overhead of virtual method calls.

In this post I'll analyze what the implications are for virtual method calls on a field with an interface based type and a single loaded implementation of this interface. This is very common in a lot of applications and considered a good practice e.g. to simplify testing by stubs/mocking the dependency. However in production there will only be a single loaded implementation of this interface.

For this post I'll use the following interface. It has a single method 'size' that returns an arbitrary value.
interface Service {
    int size();
}
There is also a very basic implementation with a 'size' method that can easily be inlined:
class ServiceImpl implements Service {
    private int size = 1;

    @Override
    public int size() {
        return size;
    }
}
And we have the following program:
public class SingleImplementation {

    private final Service service;

    public SingleImplementation(Service service) {
        this.service = service;
    }

    public int sizePlusOne() {
        return service.size() + 1;
    }

    public static void main(String[] args) {
        SingleImplementation f = new SingleImplementation(new ServiceImpl());

        int result = 0;

        for (int k = 0; k < 100_000; k++) {
            result += f.sizePlusOne();
        }

        System.out.println("result:" + result);
    }
}
The focus will be on the 'sizePlusOne' method. The logic isn't very exciting, but it will deliver easy to understand Assembly.

For this post we have the following assumptions:

  • we only care about the output of the C2 compiler
  • we are using Java hotspot 1.8.0_91

Assembly

The question is what if there is any price to pay for using an interface as a field type.

To determine this we'll be outputting the Assembly using the following JVM settings:

-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:CompileCommand=print,*SingleImplementation.sizePlusOne
Tiered compilation is disabled since we only care for the C2 Assembly. Inlining is disabled so we can focus on the 'sizePlusOne' method instead of it being lined in the 'main' method and the JIT eliminating the whole loop and calculating the end value.

After running the program the following Assembly is emitted:

Compiled method (c2)     169    8             com.liskov.SingleImplementation::sizePlusOne (12 bytes)
 total in heap  [0x000000010304d0d0,0x000000010304d3a8] = 728
 relocation     [0x000000010304d1f0,0x000000010304d208] = 24
 main code      [0x000000010304d220,0x000000010304d2a0] = 128
 stub code      [0x000000010304d2a0,0x000000010304d2b8] = 24
 oops           [0x000000010304d2b8,0x000000010304d2c0] = 8
 metadata       [0x000000010304d2c0,0x000000010304d2d0] = 16
 scopes data    [0x000000010304d2d0,0x000000010304d300] = 48
 scopes pcs     [0x000000010304d300,0x000000010304d390] = 144
 dependencies   [0x000000010304d390,0x000000010304d398] = 8
 nul chk table  [0x000000010304d398,0x000000010304d3a8] = 16
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x000000010304d0d0:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Constants]
  # {method} {0x00000001cb5b7478} 'sizePlusOne' '()I' in 'com/liskov/SingleImplementation'
  #           [sp+0x20]  (sp of caller)
  0x000000010304d220: mov    r10d,DWORD PTR [rsi+0x8]
  0x000000010304d224: shl    r10,0x3
  0x000000010304d228: cmp    rax,r10
  0x000000010304d22b: jne    0x0000000103019b60  ;   {runtime_call}
  0x000000010304d231: data32 xchg ax,ax
  0x000000010304d234: nop    DWORD PTR [rax+rax*1+0x0]
  0x000000010304d23c: data32 data32 xchg ax,ax
[Verified Entry Point]
  0x000000010304d240: mov    DWORD PTR [rsp-0x14000],eax
  0x000000010304d247: push   rbp
  0x000000010304d248: sub    rsp,0x10           ;*synchronization entry
                                                ; - com.liskov.SingleImplementation::sizePlusOne@-1 (line 12)

  0x000000010304d24c: mov    r11d,DWORD PTR [rsi+0xc]  ;*getfield service
                                                ; - com.liskov.SingleImplementation::sizePlusOne@1 (line 12)

  0x000000010304d250: mov    r10d,DWORD PTR [r12+r11*8+0x8]
                                                ; implicit exception: dispatches to 0x000000010304d289
  0x000000010304d255: cmp    r10d,0x31642b42    ;   {metadata('com/liskov/ServiceImpl')}
  0x000000010304d25c: jne    0x000000010304d274
  0x000000010304d25e: lea    r10,[r12+r11*8]    ;*invokeinterface size
                                                ; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)

  0x000000010304d262: mov    eax,DWORD PTR [r10+0xc]
  0x000000010304d266: inc    eax                ;*iadd
                                                ; - com.liskov.SingleImplementation::sizePlusOne@10 (line 12)

  0x000000010304d268: add    rsp,0x10
  0x000000010304d26c: pop    rbp
  0x000000010304d26d: test   DWORD PTR [rip+0xfffffffffe7b1d8d],eax        # 0x00000001017ff000
                                                ;   {poll_return}
  0x000000010304d273: ret    
  0x000000010304d274: mov    esi,0xffffffde
  0x000000010304d279: mov    ebp,r11d
  0x000000010304d27c: data32 xchg ax,ax
  0x000000010304d27f: call   0x000000010301b120  ; OopMap{rbp=NarrowOop off=100}
                                                ;*invokeinterface size
                                                ; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
                                                ;   {runtime_call}
  0x000000010304d284: call   0x0000000102440e44  ;   {runtime_call}
  0x000000010304d289: mov    esi,0xfffffff6
  0x000000010304d28e: nop
  0x000000010304d28f: call   0x000000010301b120  ; OopMap{off=116}
                                                ;*invokeinterface size
                                                ; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
                                                ;   {runtime_call}
  0x000000010304d294: call   0x0000000102440e44  ;*invokeinterface size
                                                ; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
                                                ;   {runtime_call}
  0x000000010304d299: hlt    
  0x000000010304d29a: hlt    
  0x000000010304d29b: hlt    
  0x000000010304d29c: hlt    
  0x000000010304d29d: hlt    
  0x000000010304d29e: hlt    
  0x000000010304d29f: hlt    
[Exception Handler]
[Stub Code]
  0x000000010304d2a0: jmp    0x000000010303ff60  ;   {no_reloc}
[Deopt Handler Code]
  0x000000010304d2a5: call   0x000000010304d2aa
  0x000000010304d2aa: sub    QWORD PTR [rsp],0x5
  0x000000010304d2af: jmp    0x000000010301ad00  ;   {runtime_call}
  0x000000010304d2b4: hlt    
  0x000000010304d2b5: hlt    
  0x000000010304d2b6: hlt    
  0x000000010304d2b7: hlt    
OopMapSet contains 2 OopMaps

#0 
OopMap{rbp=NarrowOop off=100}
#1 
OopMap{off=116}

The following code is the relevant Assembly:

  0x000000010304d24c: mov    r11d,DWORD PTR [rsi+0xc]  
    ; loads the 'service' field in r11d.
  0x000000010304d250: mov    r10d,DWORD PTR [r12+r11*8+0x8]
    ; loads the address of the class of the service object into r10d
  0x000000010304d255: cmp    r10d,0x31642b42   
    ; compare the the address of the class with address ServiceImpl class
  0x000000010304d25c: jne    0x000000010304d274
    ; if it isn't of type ServiceImpl, jump to the uncommon trap
  0x000000010304d25e: lea    r10,[r12+r11*8]    ;*invokeinterface size
                                                ; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)

  0x000000010304d262: mov    eax,DWORD PTR [r10+0xc]
    ; copy the size field into eax
  0x000000010304d266: inc    eax      
    ; add 1 to eax
We can conclude that the JIT has removed the virtual call and has even inlined the ServiceImpl.size method. However what we can also see a typeguard. This is a bit surprising because there is only a single Service implementation loaded. If a conflicting class would be loaded in the future, the class hierarchy analysis should have spotted this and trigger a code deoptimization. Therefore there should not be a need for a guard and a uncommon trap.

Abstract class to the rescue

I have posted the question why this typeguard was added on the hotspot compiler dev mailinglist. And luckily Aleksey Shipilev pointed out the cause of the problem: class hierarchy analysis on Hotspot doesn't deal with interfaces correctly.

Therefor I switched to an abstract class to see if the JIT is able to get rid of this typeguard.

abstract class Service {
   abstract int size();
}

class ServiceImpl extends Service {
    private int size = 1;

    @Override
    public int size() {
        return size;
    }
}
If the program is rerun, the following Assembly is emitted:
Compiled method (c2)     186    8             com.liskov.SingleImplementation::sizePlusOne (10 bytes)
 total in heap  [0x000000010563c250,0x000000010563c4d0] = 640
 relocation     [0x000000010563c370,0x000000010563c380] = 16
 main code      [0x000000010563c380,0x000000010563c3e0] = 96
 stub code      [0x000000010563c3e0,0x000000010563c3f8] = 24
 oops           [0x000000010563c3f8,0x000000010563c400] = 8
 metadata       [0x000000010563c400,0x000000010563c420] = 32
 scopes data    [0x000000010563c420,0x000000010563c448] = 40
 scopes pcs     [0x000000010563c448,0x000000010563c4b8] = 112
 dependencies   [0x000000010563c4b8,0x000000010563c4c0] = 8
 nul chk table  [0x000000010563c4c0,0x000000010563c4d0] = 16
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x000000010563c250:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Constants]
  # {method} {0x000000010d964478} 'sizePlusOne' '()I' in 'com/liskov/SingleImplementation'
  #           [sp+0x20]  (sp of caller)
  0x000000010563c380: mov    r10d,DWORD PTR [rsi+0x8]
  0x000000010563c384: shl    r10,0x3
  0x000000010563c388: cmp    rax,r10
  0x000000010563c38b: jne    0x0000000105607b60  ;   {runtime_call}
  0x000000010563c391: data32 xchg ax,ax
  0x000000010563c394: nop    DWORD PTR [rax+rax*1+0x0]
  0x000000010563c39c: data32 data32 xchg ax,ax
[Verified Entry Point]
  0x000000010563c3a0: mov    DWORD PTR [rsp-0x14000],eax
  0x000000010563c3a7: push   rbp
  0x000000010563c3a8: sub    rsp,0x10           ;*synchronization entry
                                                ; - com.liskov.SingleImplementation::sizePlusOne@-1 (line 12)

  0x000000010563c3ac: mov    r11d,DWORD PTR [rsi+0xc]  ;*getfield service
                                                ; - com.liskov.SingleImplementation::sizePlusOne@1 (line 12)

  0x000000010563c3b0: mov    eax,DWORD PTR [r12+r11*8+0xc]
                                                ; implicit exception: dispatches to 0x000000010563c3c3
  0x000000010563c3b5: inc    eax                ;*iadd
                                                ; - com.liskov.SingleImplementation::sizePlusOne@8 (line 12)

  0x000000010563c3b7: add    rsp,0x10
  0x000000010563c3bb: pop    rbp
  0x000000010563c3bc: test   DWORD PTR [rip+0xfffffffffe790c3e],eax        # 0x0000000103dcd000
                                                ;   {poll_return}
  0x000000010563c3c2: ret    
  0x000000010563c3c3: mov    esi,0xfffffff6
  0x000000010563c3c8: data32 xchg ax,ax
  0x000000010563c3cb: call   0x0000000105609120  ; OopMap{off=80}
                                                ;*invokevirtual size
                                                ; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
                                                ;   {runtime_call}
  0x000000010563c3d0: call   0x0000000104a40e44  ;*invokevirtual size
                                                ; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
                                                ;   {runtime_call}
  0x000000010563c3d5: hlt    
  0x000000010563c3d6: hlt    
  0x000000010563c3d7: hlt    
  0x000000010563c3d8: hlt    
  0x000000010563c3d9: hlt    
  0x000000010563c3da: hlt    
  0x000000010563c3db: hlt    
  0x000000010563c3dc: hlt    
  0x000000010563c3dd: hlt    
  0x000000010563c3de: hlt    
  0x000000010563c3df: hlt    
[Exception Handler]
[Stub Code]
  0x000000010563c3e0: jmp    0x000000010562df60  ;   {no_reloc}
[Deopt Handler Code]
  0x000000010563c3e5: call   0x000000010563c3ea
  0x000000010563c3ea: sub    QWORD PTR [rsp],0x5
  0x000000010563c3ef: jmp    0x0000000105608d00  ;   {runtime_call}
  0x000000010563c3f4: hlt    
  0x000000010563c3f5: hlt    
  0x000000010563c3f6: hlt    
  0x000000010563c3f7: hlt    
OopMapSet contains 1 OopMaps

#0 
OopMap{off=80}
After removing all noise, the following Assembly remains:
  0x000000010563c3ac: mov    r11d,DWORD PTR [rsi+0xc]  ;*getfield service
                  ;; load the service field into register r11d
  0x000000010563c3b0: mov    eax,DWORD PTR [r12+r11*8+0xc]
                    ;; load the size field of the service into register eax
  0x000000010563c3b5: inc    eax          
       ;; add one to eax
We can immediately see that the typeguard is gone.

Conclusions

If only a single implementation is loaded and field uses an:
  1. interface based type, the method gets inlined but a typeguard is added.
  2. abstract class based type, the method gets inlined and there is no typeguard.
So in case of an abstract class based field, there is no penalty to pay for applying Liskov Substitution Principle. In case of an interface base field, there is a extra unwanted typeguard.

Note

Please don't go and replace all your interfaces by abstract classes. In most cases code will be slow for other reasons than a simple typeguard. If it gets executed frequently, the branch predictor will make the right prediction. And hopefully in the near future the problems in the class hierarchy analysis get resolved.

woensdag 25 januari 2017

final static boolean & JIT

For this post we are going to look at the cost of having a final static boolean in the code. They can be very useful to enable or disable certain behavior e.g. tracing, logging etc. The question is what kind of performance implications it has.

The reason for making this post is that I didn't know the implications and I asked the question on the Mechanical Sympathy Mailinglist. So I would like to thank the people on this mailing list for answering my question.

For this post we have the following assumptions:

  • we only care about the output of the C2 compiler
  • we are using Java hotspot 1.8.0_91

Constant expression

Let's start with the most basic case where the final static field is initialized using a constant expression:
public class StaticFinal_ConstantExpression {

    public static void main(String[] args) {
        int result = 0;
        for (int k = 0; k < 100_000; k++) {
            result += doMath(k);
        }
        System.out.println(result);
    }

    final static boolean ENABLED = true;

    public static int doMath(int a) {
        if (ENABLED) {
            return a + 1;
        } else {
            return a - 1;
        }
    }
}
The actual logic in the 'doMath' isn't terribly exciting. The main purpose provide easy to understand bytecode or Assembly.

When we check the bytecode for the 'doMath' method using 'javap -c StaticFinal_ConstantExpression.class' we get the following:

  public static int doMath(int);
    Code:
       0: iload_0
       1: iconst_1
       2: iadd
       3: ireturn
If we would convert this back to Java we would get:
public static int doMath(int a) {
 return a + 1;
}
The Javac has propagated the ENABLED constant and completely removed the dead code. We don't even to look at the Assembly.

Be careful with final statics and constant expressions; if the value is changed and one or more classes that read this value are not recompiled, they will not see the new value.

Non constant expression

In the previous example there was a hard coded constant value for ENABLED. In practice you often want something more flexible, e.g. using some kind of System property. So let's change the ENABLED initialization so it gets its value from a System property 'enabled'.
public class StaticFinal_NonConstantExpression {

    public static void main(String[] args) {
        int result = 0;
        for (int k = 0; k < 100_000; k++) {
            result += doMath(k);
        }
        System.out.println(result);
    }

    final static boolean ENABLED = Boolean.getBoolean("enabled");

    public static int doMath(int a) {
        if (ENABLED) {
            return a + 1;
        } else {
            return a - 1;
        }
    }
}
And if we display the relevant bytecode using 'javap -c StaticFinal_NonConstantExpression.class', we get the following.
  static final boolean ENABLED;
 
  public static int doMath(int);
    Code:
       0: getstatic     #6                  // Field ENABLED:Z
       3: ifeq          10
       6: iload_0
       7: iconst_1
       8: iadd
       9: ireturn
      10: iload_0
      11: iconst_1
      12: isub
      13: ireturn

  static {};
    Code:
       0: ldc           #7                  // String enabled
       2: invokestatic  #8                  // Method java/lang/Boolean.getBoolean:(Ljava/lang/String;)Z
       5: putstatic     #6                  // Field ENABLED:Z
       8: return
We can see that the 'doMath' still contains the check and the logic for both branches. The Javac has not made any optimizations since it doesn't know which value ENABLED is going to be at runtime.

Lets go a level deeper and see what kind of Assembly we are going to get. To display the Assembly, we'll use the following parameters

-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:CompileCommand=print,*.doMath
-Denabled=true
Tiered compilation is disabled since we are only interested in the C2 output. Inlining is disabled to prevent the 'doMath' method getting inlined into the main loop. Also we set the enabled system property to true.

When we run we get the following Assembly

Compiled method (c2)     248    8             com.constant_folding.StaticFinal_NonConstantExpression::doMath (14 bytes)
 total in heap  [0x00000001083a7a90,0x00000001083a7c60] = 464
 relocation     [0x00000001083a7bb0,0x00000001083a7bb8] = 8
 main code      [0x00000001083a7bc0,0x00000001083a7be0] = 32
 stub code      [0x00000001083a7be0,0x00000001083a7bf8] = 24
 oops           [0x00000001083a7bf8,0x00000001083a7c00] = 8
 metadata       [0x00000001083a7c00,0x00000001083a7c08] = 8
 scopes data    [0x00000001083a7c08,0x00000001083a7c18] = 16
 scopes pcs     [0x00000001083a7c18,0x00000001083a7c58] = 64
 dependencies   [0x00000001083a7c58,0x00000001083a7c60] = 8
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x00000001083a7a90:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x00000001d08e24b8} 'doMath' '(I)I' in 'com/constant_folding/StaticFinal_NonConstantExpression'
  # parm0:    rsi       = int
  #           [sp+0x20]  (sp of caller)
  0x00000001083a7bc0: sub    rsp,0x18
  0x00000001083a7bc7: mov    QWORD PTR [rsp+0x10],rbp  ;*synchronization entry
                                                ; - com.constant_folding.StaticFinal_NonConstantExpression::doMath@-1 (line 16)

  0x00000001083a7bcc: mov    eax,esi
  0x00000001083a7bce: inc    eax                ;*iadd
                                                ; - com.constant_folding.StaticFinal_NonConstantExpression::doMath@8 (line 17)

  0x00000001083a7bd0: add    rsp,0x10
  0x00000001083a7bd4: pop    rbp
  0x00000001083a7bd5: test   DWORD PTR [rip+0xfffffffffff74425],eax        # 0x000000010831c000
                                                ;   {poll_return}
  0x00000001083a7bdb: ret    
  0x00000001083a7bdc: hlt    
  0x00000001083a7bdd: hlt    
  0x00000001083a7bde: hlt    
  0x00000001083a7bdf: hlt    
[Exception Handler]
[Stub Code]
  0x00000001083a7be0: jmp    0x000000010839af60  ;   {no_reloc}
[Deopt Handler Code]
  0x00000001083a7be5: call   0x00000001083a7bea
  0x00000001083a7bea: sub    QWORD PTR [rsp],0x5
  0x00000001083a7bef: jmp    0x0000000108375d00  ;   {runtime_call}
  0x00000001083a7bf4: hlt    
  0x00000001083a7bf5: hlt    
  0x00000001083a7bf6: hlt    
  0x00000001083a7bf7: hlt    
OopMapSet contains 0 OopMaps
Lot of output. Let's remove everything that isn't relevant:
  0x00000001083a7bcc: mov    eax,esi
   ;; copy the content of 'a' into eax
  0x00000001083a7bce: inc    eax        
   ;; increase eax by one
The JIT has propagated the ENABLED constant and removed the dead code.

If we run with '-Denabled=false', we'll get similar Assembly:

  0x000000010b4eb7cc: mov    eax,esi
  0x000000010b4eb7ce: dec    eax                ;*isub
                                                ; - com.constant_folding.StaticFinal_NonConstantExpression::doMath@12 (line 19)
So also in this case the JIT has propagated the constant and removed the dead code.

Original size of bytecode matters

So it seems that we can use static final with non constant expression to disable or enable certain behavior. Unfortunately this isn't true. Inlining can still be prevented because the choice to inline is determined based on the original bytecode size. To demonstrate this we'll use the following code:
public class StaticFinal_OriginalSizeMatters {

    public static void main(String[] args) {
        int result = 0;
        for (int k = 0; k < 1_000_000; k++) {
            result += doMath(k);
        }
        System.out.println(result);
    }

    final static boolean ENABLED = Boolean.getBoolean("enabled");

    public static int doMath(int a) {
        if (ENABLED) {
            System.out.print("n");
            System.out.print("e");
            System.out.print("v");
            System.out.print("e");
            System.out.print("r");
            return a + 1;
        } else {
            return a - 1;
        }
    }
}
When we run with using:
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintInlining
-XX:FreqInlineSize=50
-Denabled=false
We'll see the following output:
@ 12   com.constant_folding.StaticFinal_OriginalSizeMatters::doMath (54 bytes)   callee is too large
@ 27  java/io/PrintStream::println (not loaded)   not inlineable
@ 12   com.constant_folding.StaticFinal_OriginalSizeMatters::doMath (54 bytes)   callee is too large
@ 27  java/io/PrintStream::println (not loaded)   not inlineable
@ 12   com.constant_folding.StaticFinal_OriginalSizeMatters::doMath (54 bytes)   hot method too big
So even though ENABLED is false, the method is still too fat to get inlined because the original bytecode is used.

Conclusion

A final static boolean with a constant expression is completely free. The Javac will do the constant propagation and dead code elimination and there is no price to pay.

A final static boolean with a non constant expression will be fully optimized by the JIT. However inlining can be prevented because the original size of the bytecode determines if something gets inlined; not what the JIT made out of it.

maandag 23 januari 2017

Devirtualization

When a non static method is called on an interface or class, an `invokeinterface` or `invokevirtual` bytecode is emitted. This makes it possible to do polymorphic calls; so at runtime determining which method implementation should be executed.

The following useless method is used in this post.

    public static int sizePlusOne(Collection c) {
        return c.size() + 1;
    }

The sizePlusOne function calls the size method on the passed collection and increments the result by one.

If we view the bytecode using 'javap -c SomeClass' then we get the following output:

  public static int sizePlusOne(java.util.Collection);
    Code:
       0: aload_0
       1: invokeinterface #18,  1           // InterfaceMethod java/util/Collection.size:()I
       6: iconst_1
       7: iadd
       8: ireturn
The problem is that a `invokeinterface` or `invokevirtual` is a dynamic dispatch instead of a static dispatch. A dynamic dispatch involves more complexity because a vtable lookup is required and it prevents inlining since the concrete method implementation isn't known.

What the JIT does is speculate that there will only be 1 or 2 types at most and instead of doing a dynamic dispatch, there are some cheap guards and static dispatching. To test this we'll be using Hotspot 1.8.0_91. Keep in mind that in the future the optimizations could be implemented differently.

For this post we assume that:

  • sizePlusOne method is called equally with the different types. There is not a dominant type.
  • there are many implementations loaded for interface. Having only a single loaded implementation deserves a post on its own.
  • only the C2 code is relevant

Monomorphic callsite

The simplest version of the optimization is a monomorphic call site: so only a single type being used.

To demonstrate the monomorphic callsite, we use the following example:
public class Devirtualization_MonoMorphic {

    public static void main(String[] args) {
        ArrayList arrayList = new ArrayList();

        int result = 0;

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(arrayList);
        }

        System.out.println("result:" + result);
    }

    public static int sizePlusOne(Collection c) {
        return c.size() + 1;
    }
}
So we loop 100k times and execute the 'sizePlusOne' method with an ArrayList.

And we run it with the following settings to get the Assembly:

-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*Devirtualization_MonoMorphic.sizePlusOne
Inlining is disabled so we can focus on the content of the sizePlusOne method. This will not prevent inlining of simple accessors like ArrayList/LinkedList.size. For more information see '-XX:-InlineAccessors' and credits go to Andrew Haley for pointing this out. Background compilation is disabled so that compilation isn't done asynchronous and tiered compilation disabled since for this blogpost only the C2 emitted Assembly will be discussed.

When the program is run, the following Assembly is printed.

CompilerOracle: print *Devirtualization_MonoMorphic.sizePlusOne
Compiled method (c2)     247    9             com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne (9 bytes)
 total in heap  [0x000000010be7e3d0,0x000000010be7e678] = 680
 relocation     [0x000000010be7e4f0,0x000000010be7e508] = 24
 main code      [0x000000010be7e520,0x000000010be7e580] = 96
 stub code      [0x000000010be7e580,0x000000010be7e598] = 24
 oops           [0x000000010be7e598,0x000000010be7e5a0] = 8
 metadata       [0x000000010be7e5a0,0x000000010be7e5b0] = 16
 scopes data    [0x000000010be7e5b0,0x000000010be7e5e0] = 48
 scopes pcs     [0x000000010be7e5e0,0x000000010be7e660] = 128
 dependencies   [0x000000010be7e660,0x000000010be7e668] = 8
 nul chk table  [0x000000010be7e668,0x000000010be7e678] = 16
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x000000010be7e3d0:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x000000011417a560} 'sizePlusOne' '(Ljava/util/Collection;)I' in 'com/devirtualization/Devirtualization_MonoMorphic'
  # parm0:    rsi:rsi   = 'java/util/Collection'
  #           [sp+0x20]  (sp of caller)
  0x000000010be7e520: mov    DWORD PTR [rsp-0x14000],eax
  0x000000010be7e527: push   rbp
  0x000000010be7e528: sub    rsp,0x10           ;*synchronization entry
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@-1 (line 21)

  0x000000010be7e52c: mov    r11d,DWORD PTR [rsi+0x8]  ; implicit exception: dispatches to 0x000000010be7e55d
  0x000000010be7e530: cmp    r11d,0xf8003231    ;   {metadata('java/util/ArrayList')}
  0x000000010be7e537: jne    0x000000010be7e54a  ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)

  0x000000010be7e539: mov    eax,DWORD PTR [rsi+0x10]
  0x000000010be7e53c: inc    eax                ;*iadd
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@7 (line 21)

  0x000000010be7e53e: add    rsp,0x10
  0x000000010be7e542: pop    rbp
  0x000000010be7e543: test   DWORD PTR [rip+0xffffffffff6fdab7],eax        # 0x000000010b57c000
                                                ;   {poll_return}
  0x000000010be7e549: ret    
  0x000000010be7e54a: mov    rbp,rsi
  0x000000010be7e54d: mov    esi,0xffffffde
  0x000000010be7e552: nop
  0x000000010be7e553: call   0x000000010be47120  ; OopMap{rbp=Oop off=56}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)
                                                ;   {runtime_call}
  0x000000010be7e558: call   0x000000010a9a2e44  ;   {runtime_call}
  0x000000010be7e55d: mov    esi,0xfffffff6
  0x000000010be7e562: nop
  0x000000010be7e563: call   0x000000010be47120  ; OopMap{off=72}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)
                                                ;   {runtime_call}
  0x000000010be7e568: call   0x000000010a9a2e44  ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)
                                                ;   {runtime_call}
  0x000000010be7e56d: hlt    
  0x000000010be7e56e: hlt    
  0x000000010be7e56f: hlt    
  0x000000010be7e570: hlt    
  0x000000010be7e571: hlt    
  0x000000010be7e572: hlt    
  0x000000010be7e573: hlt    
  0x000000010be7e574: hlt    
  0x000000010be7e575: hlt    
  0x000000010be7e576: hlt    
  0x000000010be7e577: hlt    
  0x000000010be7e578: hlt    
  0x000000010be7e579: hlt    
  0x000000010be7e57a: hlt    
  0x000000010be7e57b: hlt    
  0x000000010be7e57c: hlt    
  0x000000010be7e57d: hlt    
  0x000000010be7e57e: hlt    
  0x000000010be7e57f: hlt    
[Exception Handler]
[Stub Code]
  0x000000010be7e580: jmp    0x000000010be6bf60  ;   {no_reloc}
[Deopt Handler Code]
  0x000000010be7e585: call   0x000000010be7e58a
  0x000000010be7e58a: sub    QWORD PTR [rsp],0x5
  0x000000010be7e58f: jmp    0x000000010be46d00  ;   {runtime_call}
  0x000000010be7e594: hlt    
  0x000000010be7e595: hlt    
  0x000000010be7e596: hlt    
  0x000000010be7e597: hlt    
OopMapSet contains 2 OopMaps

#0 
OopMap{rbp=Oop off=56}
#1 
OopMap{off=72}
That is a lot output. Lets remove everything not relevant and add some comments:
  0x000000010be7e52c: mov    r11d,DWORD PTR [rsi+0x8] 
    ;; Loads the address of the class of 'c' into register r11
  0x000000010be7e530: cmp    r11d,0xf8003231 
    ;; Compare that address with the address of the ArrayList class
  0x000000010be7e537: jne    0x000000010be7e54a 
    ;; if the classes are different, jump to an uncommon trap 
  0x000000010be7e539: mov    eax,DWORD PTR [rsi+0x10]
    ;; copies the size field of the ArrayList object into eax
  0x000000010be7e53c: inc    eax                ;*iadd   
    ;; Add one to eax. 
If this would be converted to pseudo Java code, we would get something like this:
  if(collection.class != ArrayList.class){
    uncommonTrap()
  }

  int result = c.size;
  result++;

Conclusion

With only a single type at the call site, the Assembly only has a simple type-guard and then a static dispatch. So the dynamic dispatch is replaced by a static dispatch and inlining of the method is possible.

Bimorphic callsite

The previous example has only 1 concrete type at the callsite. But what happens when a second type is added.
public class Devirtualization_BiMorphic {

    public static void main(String[] args) {
        ArrayList arrayList = new ArrayList();
        LinkedList linkedList = new LinkedList();

        int result = 0;

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(arrayList);
        }

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(linkedList);
        }

        System.out.println("result:" + result);
    }

    public static int sizePlusOne(Collection c) {
        return c.size() + 1;
    }
}
And we run with the following settings:
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*Devirtualization_BiMorphic.sizePlusOne
First the Assembly of the the monomorphic version is emited, and as soon as the LinkedList is encountered, the uncommon trap is executed and eventually new Assembly emitted:
Compiled method (c2)     291   11             com.devirtualization.Devirtualization_BiMorphic::sizePlusOne (9 bytes)
 total in heap  [0x000000010c339510,0x000000010c3397f8] = 744
 relocation     [0x000000010c339630,0x000000010c339648] = 24
 main code      [0x000000010c339660,0x000000010c3396c0] = 96
 stub code      [0x000000010c3396c0,0x000000010c3396d8] = 24
 oops           [0x000000010c3396d8,0x000000010c3396e0] = 8
 metadata       [0x000000010c3396e0,0x000000010c339700] = 32
 scopes data    [0x000000010c339700,0x000000010c339730] = 48
 scopes pcs     [0x000000010c339730,0x000000010c3397e0] = 176
 dependencies   [0x000000010c3397e0,0x000000010c3397e8] = 8
 nul chk table  [0x000000010c3397e8,0x000000010c3397f8] = 16
Decoding compiled method 0x000000010c339510:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x00000001146395f0} 'sizePlusOne' '(Ljava/util/Collection;)I' in 'com/devirtualization/Devirtualization_BiMorphic'
  # parm0:    rsi:rsi   = 'java/util/Collection'
  #           [sp+0x20]  (sp of caller)
  0x000000010c339660: mov    DWORD PTR [rsp-0x14000],eax
  0x000000010c339667: push   rbp
  0x000000010c339668: sub    rsp,0x10           ;*synchronization entry
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@-1 (line 28)

  0x000000010c33966c: mov    r10d,DWORD PTR [rsi+0x8]  ; implicit exception: dispatches to 0x000000010c3396ad
  0x000000010c339670: cmp    r10d,0xf8003231    ;   {metadata('java/util/ArrayList')}
  0x000000010c339677: je     0x000000010c339687
  0x000000010c339679: cmp    r10d,0xf8008c83    ;   {metadata('java/util/LinkedList')}
  0x000000010c339680: jne    0x000000010c339698  ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)

  0x000000010c339682: mov    eax,DWORD PTR [rsi+0x10]  ;*getfield size
                                                ; - java.util.LinkedList::size@1 (line 326)
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)

  0x000000010c339685: jmp    0x000000010c33968a  ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)

  0x000000010c339687: mov    eax,DWORD PTR [rsi+0x10]  ;*synchronization entry
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@-1 (line 28)

  0x000000010c33968a: inc    eax                ;*iadd
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@7 (line 28)

  0x000000010c33968c: add    rsp,0x10
  0x000000010c339690: pop    rbp
  0x000000010c339691: test   DWORD PTR [rip+0xfffffffffe841969],eax        # 0x000000010ab7b000
                                                ;   {poll_return}
  0x000000010c339697: ret    
  0x000000010c339698: mov    rbp,rsi
  0x000000010c33969b: mov    esi,0xffffffc6
  0x000000010c3396a0: data32 xchg ax,ax
  0x000000010c3396a3: call   0x000000010c306120  ; OopMap{rbp=Oop off=72}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
                                                ;   {runtime_call}
  0x000000010c3396a8: call   0x000000010b840e44  ;   {runtime_call}
  0x000000010c3396ad: mov    esi,0xfffffff6
  0x000000010c3396b2: nop
  0x000000010c3396b3: call   0x000000010c306120  ; OopMap{off=88}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
                                                ;   {runtime_call}
  0x000000010c3396b8: call   0x000000010b840e44  ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
                                                ;   {runtime_call}
  0x000000010c3396bd: hlt    
  0x000000010c3396be: hlt    
  0x000000010c3396bf: hlt    
[Exception Handler]
[Stub Code]
  0x000000010c3396c0: jmp    0x000000010c32af60  ;   {no_reloc}
[Deopt Handler Code]
  0x000000010c3396c5: call   0x000000010c3396ca
  0x000000010c3396ca: sub    QWORD PTR [rsp],0x5
  0x000000010c3396cf: jmp    0x000000010c305d00  ;   {runtime_call}
  0x000000010c3396d4: hlt    
  0x000000010c3396d5: hlt    
  0x000000010c3396d6: hlt    
  0x000000010c3396d7: hlt    
OopMapSet contains 2 OopMaps

#0 
OopMap{rbp=Oop off=72}
#1 
OopMap{off=88}
Again a lot of output. Lets trim it down a bit:
  0x000000010c33966c: mov    r10d,DWORD PTR [rsi+0x8]
     ;; Loads the address of the class of 'c' into register r10d
  0x000000010c339670: cmp    r10d,0xf8003231   
     ;; Compare that address with the address of the ArrayList class
  0x000000010c339677: je     0x000000010c339687
      ;; if it is an ArrayList, jump to 0x000000010c339687
  0x000000010c339679: cmp    r10d,0xf8008c83    
      ;; it was not an ArrayList, lets compare with address of LinkedList class
  0x000000010c339680: jne    0x000000010c339698  
      ;; if jump to uncommon trap 
  0x000000010c339682: mov    eax,DWORD PTR [rsi+0x10] 
      ;; copies the size field of the LinkedList instance into eax
  0x000000010c339685: jmp    0x000000010c33968a  
  0x000000010c339687: mov    eax,DWORD PTR [rsi+0x10]  ;*synchronization entry
      ;; copes the size field of the ArrayList instance into eax
  0x000000010c33968a: inc    eax                
      ;; Add one to eax.
If we would convert this back pseudo Java we would get something like this:
int result
if(c.class == ArrayList.class){
    result = c.size;
} else if(c.class == LinkedList.class){
    result = c.size;
} else {
    uncommonTrap();
}
result++;

Conclusion

In case of a bimorphic callsite, we don't need to pay the price for a dynamic dispatch and inlining isn't prevented. However we do get 1 type-guard for one of the types and 2 type-guards for the other.

Megamorpic callsite

Lets add a third type of Collection and see what happens:
public class Devirtualization_MegaMorphic {

    public static void main(String[] args) {
        ArrayList arrayList = new ArrayList();
        LinkedList linkedList = new LinkedList();
        HashSet hashSet = new HashSet();

        int result = 0;

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(arrayList);
        }

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(linkedList);
        }

        for (int k = 0; k < 100_000; k++) {
            result += sizePlusOne(hashSet);
        }

        System.out.println("result:" + result);
    }

    public static int sizePlusOne(Collection c) {
        return c.size() + 1;
    }
}
If we run with the following settings
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*Devirtualization_MegaMorphic.sizePlusOne
Then the following Assembly is eventually emitted:
Compiled method (c2)     268   14             com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne (9 bytes)
 total in heap  [0x000000010ea7d550,0x000000010ea7d788] = 568
 relocation     [0x000000010ea7d670,0x000000010ea7d680] = 16
 main code      [0x000000010ea7d680,0x000000010ea7d6c0] = 64
 stub code      [0x000000010ea7d6c0,0x000000010ea7d6d8] = 24
 oops           [0x000000010ea7d6d8,0x000000010ea7d6e0] = 8
 metadata       [0x000000010ea7d6e0,0x000000010ea7d6e8] = 8
 scopes data    [0x000000010ea7d6e8,0x000000010ea7d708] = 32
 scopes pcs     [0x000000010ea7d708,0x000000010ea7d768] = 96
 dependencies   [0x000000010ea7d768,0x000000010ea7d770] = 8
 handler table  [0x000000010ea7d770,0x000000010ea7d788] = 24
Decoding compiled method 0x000000010ea7d550:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x0000000116d7a678} 'sizePlusOne' '(Ljava/util/Collection;)I' in 'com/devirtualization/Devirtualization_MegaMorphic'
  # parm0:    rsi:rsi   = 'java/util/Collection'
  #           [sp+0x20]  (sp of caller)
  0x000000010ea7d680: mov    DWORD PTR [rsp-0x14000],eax
  0x000000010ea7d687: push   rbp
  0x000000010ea7d688: sub    rsp,0x10           ;*synchronization entry
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@-1 (line 33)

  0x000000010ea7d68c: nop
  0x000000010ea7d68d: movabs rax,0xffffffffffffffff
  0x000000010ea7d697: call   0x000000010ea45f60  ; OopMap{off=28}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@1 (line 33)
                                                ;   {virtual_call}
  0x000000010ea7d69c: inc    eax                ;*iadd
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@7 (line 33)

  0x000000010ea7d69e: add    rsp,0x10
  0x000000010ea7d6a2: pop    rbp
  0x000000010ea7d6a3: test   DWORD PTR [rip+0xffffffffff742957],eax        # 0x000000010e1c0000
                                                ;   {poll_return}
  0x000000010ea7d6a9: ret                       ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@1 (line 33)

  0x000000010ea7d6aa: mov    rsi,rax
  0x000000010ea7d6ad: add    rsp,0x10
  0x000000010ea7d6b1: pop    rbp
  0x000000010ea7d6b2: jmp    0x000000010ea6ece0  ;   {runtime_call}
  0x000000010ea7d6b7: hlt    
  0x000000010ea7d6b8: hlt    
  0x000000010ea7d6b9: hlt    
  0x000000010ea7d6ba: hlt    
  0x000000010ea7d6bb: hlt    
  0x000000010ea7d6bc: hlt    
  0x000000010ea7d6bd: hlt    
  0x000000010ea7d6be: hlt    
  0x000000010ea7d6bf: hlt    
[Exception Handler]
[Stub Code]
  0x000000010ea7d6c0: jmp    0x000000010ea6bf60  ;   {no_reloc}
[Deopt Handler Code]
  0x000000010ea7d6c5: call   0x000000010ea7d6ca
  0x000000010ea7d6ca: sub    QWORD PTR [rsp],0x5
  0x000000010ea7d6cf: jmp    0x000000010ea46d00  ;   {runtime_call}
  0x000000010ea7d6d4: hlt    
  0x000000010ea7d6d5: hlt    
  0x000000010ea7d6d6: hlt    
  0x000000010ea7d6d7: hlt    
Again a lot of output. If we extract only the relevant part, we get:
  0x000000010ea7d68d: movabs rax,0xffffffffffffffff
  0x000000010ea7d697: call   0x000000010ea45f60  ; OopMap{off=28}
                                                ;*invokeinterface size
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@1 (line 33)
                                                ;   {virtual_call}
  0x000000010ea7d69c: inc    eax                ;*iadd
                                                ; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@7 (line 33)
The code above contains the expensive virtual call to Collection.size.

Conclusion

The JIT has given up on devirtualizing the callsite after it has 3 encountered different implementations.

Recommended reads

I would start out with this excellent post of Aleksey Shipilëv. He goes into much more detail.

Very good post from Richard Warburton.

And very informative post from Martin Thompson.