tag:blogger.com,1999:blog-7259067485900832912024-03-13T16:09:45.417-07:00Simple thoughts & complex stuffpveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.comBlogger10125tag:blogger.com,1999:blog-725906748590083291.post-21673430689514354262021-11-22T04:00:00.017-08:002021-11-29T23:36:14.597-08:00Will temporary assignment to local variable impact performance?This post is written as a consequence of the following stack overflow <a href="https://stackoverflow.com/questions/70026850/is-it-better-to-store-values-from-instance-fields-as-variables-prior-to-using-th">question</a>. The question is if temporary assigning the results of some expression to local variables has any performance impact.
<br/><br/>
So is there any performance difference between the following 2 snipits:
<pre>
int a = triple.getA();
int b = triple.getB();
int c = triple.getC();
return a + b + c;
</pre>
and:
<pre>
return triple.getA() + triple.getC() + triple.getC();
</pre>
The JIT is pretty clever. If there would be any performance gain by removing temporary assignment to local variables, it will happily get rid of it. However, there is one caveat; the inline threshold for frequently executed methods depends on the bytecode size of the method; that is the original size of the bytecode. So if you make the bytecode of a method larger than needed, it can prevent inlining. And inlining is one of the most crucial optimizations because it provides the ability for further optimizations as we'll see.
<br/>
<h1>Benchmark</h1>
The following JMH benchmark will determine if there is any performance impact.
<pre>
package org.sample;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.CompilerControl;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OperationsPerInvocation;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import java.util.concurrent.TimeUnit;
import static org.openjdk.jmh.annotations.CompilerControl.Mode.DONT_INLINE;
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 1)
@Measurement(iterations = 1)
@Fork(warmups = 1, value = 1)
@OperationsPerInvocation(InlineBenchmark.OPERATIONS)
public class InlineBenchmark {
public static final int OPERATIONS = 1000 * 1000;
@State(Scope.Benchmark)
static public class Triple {
int a;
int b;
int c;
int getA() {
return a;
}
int getB() {
return b;
}
int getC() {
return c;
}
}
static int large(Triple triple) {
int a = triple.getA();
int b = triple.getB();
int c = triple.getC();
return a + b + c;
}
static int small(Triple triple) {
return triple.getA() + triple.getB() + triple.getC();
}
@CompilerControl(DONT_INLINE)
@Benchmark
@Fork(jvmArgs = "-XX:FreqInlineSize=20")
public int benchmark_small(Triple triple) {
int v = 0;
for (int k = 0; k < OPERATIONS; k++) {
v = small(triple);
}
return v;
}
@CompilerControl(DONT_INLINE)
@Benchmark
@Fork(jvmArgs = "-XX:FreqInlineSize=20")
public long benchmark_large_with_low_inline_size(Triple triple) {
int v = 0;
for (int k = 0; k < OPERATIONS; k++) {
v = large(triple);
}
return v;
}
@CompilerControl(DONT_INLINE)
@Benchmark
@Fork(jvmArgs = "-XX:FreqInlineSize=21")
public long benchmark_large_with_high_inline_size(Triple triple) {
int v = 0;
for (int k = 0; k < OPERATIONS; k++) {
v = large(triple);
}
return v;
}
}
</pre>
There are 2 important methods. The 'small' method that does not use temporary local variables:
<pre>
static int small(Triple triple) {
return triple.getA() + triple.getB() + triple.getC();
}
</pre>
And the 'large' method that does use temporary local variables:
<pre>
static int large(Triple triple) {
int a = triple.getA();
int b = triple.getB();
int c = triple.getC();
return a + b + c;
}
</pre>
<br/>
There are 3 benchmark methods:
<ol>
<li>benchmark_short_sum: this will call the 'small' method in a loop. It is configured with the same inline size as the benchmark_long_sum_with_low_inline_size.</li>
<li>benchmark_long_sum_with_low_inline_size: this will call the 'large' method in a loop. The inline size has been set just above the bytecode size of the method. This will disable the inlining of the 'large' method.</li>
<li>benchmark_long_sum_with_high_inline_size: this will call the 'large' method in a loop. The inline size has been set to the bytecode size of the method. This will enable the inlining of the 'large' method.</li>
</ol>
If we look at the bytecode for the 'large' and 'small' method, we can see that the size of the bytecode for the 'large' method is 21 bytes and for the 'small' method it is 15.
<pre>
static int large(org.sample.InlineBenchmark$Triple);
Code:
0: aload_0
1: invokevirtual #2 // Method org/sample/InlineBenchmark$Triple.getA:()I
4: istore_1
5: aload_0
6: invokevirtual #3 // Method org/sample/InlineBenchmark$Triple.getB:()I
9: istore_2
10: aload_0
11: invokevirtual #4 // Method org/sample/InlineBenchmark$Triple.getC:()I
14: istore_3
15: iload_1
16: iload_2
17: iadd
18: iload_3
19: iadd
20: ireturn
static int small(org.sample.InlineBenchmark$Triple);
Code:
0: aload_0
1: invokevirtual #2 // Method org/sample/InlineBenchmark$Triple.getA:()I
4: aload_0
5: invokevirtual #3 // Method org/sample/InlineBenchmark$Triple.getB:()I
8: iadd
9: aload_0
10: invokevirtual #4 // Method org/sample/InlineBenchmark$Triple.getC:()I
13: iadd
14: ireturn
</pre>
So there is definitely a difference because the javac did not optimize the code (this is the task of the JIT).
<br/><br/>
In the above benchmark, the benchmark_large_with_high_inline_size has been configured with FreqInlineSize=21, which allows the 'large' method to be inlined and the benchmark_large_with_low_inline_size has been configured with FreqInlineSize=20 which prevents the 'large' method from being inlined.
<br/><br/>
Results:
<pre>
Benchmark Mode Cnt Score Error Units
InlineBenchmark.benchmark_large_with_high_inline_size avgt ≈ 10⁻⁶ ns/op
InlineBenchmark.benchmark_large_with_low_inline_size avgt 1.306 ns/op
InlineBenchmark.benchmark_small avgt ≈ 10⁻⁶ ns/op
</pre>
As can be seen, benchmark_large_with_high_inline_size is a lot faster than the benchmark_large_with_low_inline_siz. We can also see that the 'small' method could still be inlined with a FreqInlineSize=20, where the 'large' method could not.
<br/><br/>
The reason why benchmark_large_with_high_inline_size and benchmark_small are so fast is because of inlining. The JIT could see what is happening inside the method in the loop. This prodived the opportunity for further optimization. In this case, there is no reason to execute the logic of the inlined method 1M times since the result will be the same for every iteration; just 1 time is sufficient.
<h1>Conclusions</h1>
Temporary assigning parameters to local variables can reduce performance when the inline threshold of the bytecode size of the method is exceeded. If the inline threshold isn't exceeded, there doesn't seem to be a performance impact.
<br/><br/>
This is not an excuse to remove all temporary assignments to local variables because local variables can increase clarity. Only when you have determined that a method is a bottleneck, you should optimize. Otherwise, clean code is preferred above premature optimizations.
pveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.com0tag:blogger.com,1999:blog-725906748590083291.post-83409067260134651822021-11-09T22:09:00.036-08:002021-11-22T06:44:07.062-08:00C++ atomic doesn't imply store atomicityOne thing that recently confused me, is that a C++ atomic with its atomicity guarantee does not guarantee multi-copy atomicity. In this post, I want to shed some light on the situation. I'm not a C++ expert, but I'm interested in hardware and software-level memory models.
<h2 style="text-align: left;">Multi-copy atomicity <br /></h2><p>An access is multi-copy atomic (also called store/write atomicity) when there is a single logical moment where a store issued by some CPU will become visible to all other CPUs. It is allowed that the issuing CPU can see the store early e.g. due to store to load forwarding and therefore the name multi-copy atomic is a bit misleading. </p><p>The consequence of multi-copy atomicity is that stores issued by different CPUs to different addresses will be seen by other CPUs in the same order. Therefore multi-copy atomicity guarantees that some total order over the stores exists. Examples of multi-copy atomic systems are X86 with its TSO (Total Store Order) memory model and ARMv8.
</p><pre></pre>When a single logical moment does not exist, the access is called non-multi-copy atomic. Some causes of accesses to become non-multi-copy atomic:
<ol>
<li>the store buffers are shared with a strict subset of the CPUs</li>
<li>a CPU that commits a store to the cache, doesn't wait for the cache line to be invalidated on all other CPUs</li>
</ol>
An example of a non-multi-copy atomic system is the PowerPC. The ARMv7 specification allows it as well, but all ARMv7 implementations are multi-copy atomic and to simplify the memory model, the specification was updated in ARMv8 to multi-copy atomic.
<h1>IRIW</h1>
The litmus test for multi-copy atomicity is the IRIW litmus test: independent reads of independent writes. The test is shown below:
<pre>CPU1:
A=1
CP2:
B=1
CPU3:
r1=A
[LoadLoad]
r2=B
CPU4:
r3=B
[LoadLoad]
r4=A
</pre>
Could it be that r1=1,r2=0,r3=1,r4=0? So could it be that the stores to A and B are seen in different orders? If accesses are multi-copy atomic, then this can't happen because stores to different addresses issued by different CPUs need to be seen in the same order. But when accesses are non-multi-copy atomic this can happen because stores issued by different CPUs to different addresses can be seen in different orders.
<h1>C++ atomic</h1>
With C++11 a new memory model was introduced including atomics. For this post, I'll only look at the loads/stores operations on the atomic and ignore RMW operations like compare_exchange_weak/strong. A load/store provides the following guarantees:
<ol>
<li>atomicity: the operations are indivisible and you don't end up with a value that is the combination of multiple reads/writes</li>
<li>synchronization: depending in the memory_order it will order surrounding loads/stores</li>
</ol><p>
By default all loads/stores use memory_order_seq_cst.
<br /><br />
The problem in my understanding was that I assumed that multi-copy atomicity was part of the atomicity guarantee of the atomic independent of the memory_order. This is incorrect. </p><p>Only in the case of memory_order_seq_cst, multi-copy atomicity is implied because sequential consistency requires that some total order over all loads/stores exists and hence a total order over the stores exists. Therefore if the IRIW example would be implemented using memory_order_seq_cst, then the observation isn't possible.
<br /><br />
But when a memory_order below memory_order_seq_cst is used, multi-copy atomicity is not implied. And as a consequence, if the IRIW example would be implemented using memory_order_release stores and memory_order_acquire loads, the given observation is possible.
</p><h1>Conclusion</h1>
Multi-copy atomicity and atomicity are different concepts and atomicity doesn't imply multi-copy atomicity. They are overloaded terms and it is very easy to get confused.
<h1>Update</h1><p>
It seems there are conflicting single-copy atomicity definitions as can be seen in this Twitter <a href="https://twitter.com/PeterVeentjer/status/1458319135780347905">discussion</a>. In some academic <a href="https://www.cs.tufts.edu/~nr/cs257/archive/arvind/wmm.pdf">papers</a>, single-copy atomic is a stricter version of multi-copy atomic whereby even the issuing CPU needs to see the store in the same order. But in the ARM and PowerPC reference manuals, it effectively means that when you load/store some value, you don't end up with a value that is a mix of multiple loads/stores. It doesn't say anything about the order of stores issued by different CPUs to different addresses. </p><p><br />
The original version of this post was based on the stricter version, but it seems that the ARM/PowerPC definition is more widely used and therefore I updated the post. I would like to thank Alexey Shipilev and Travis Downs to point out the conflict.</p>pveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.com0tag:blogger.com,1999:blog-725906748590083291.post-15751532579667975662021-07-20T23:15:00.032-07:002021-08-17T01:55:17.687-07:00TSO and IBM System/370
[Document is currently being updated due to new insights]
The <a href="https://en.wikipedia.org/wiki/IBM_System/370" target="_blank">IBM System/370 (IBM 370)</a> is mainframe from the 70s.
<div class="separator" style="clear: both;"><a href="https://i.pcmag.com/imagery/encyclopedia-terms/system370-_ibm370.fit_lim.size_800x.gif" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="299" data-original-width="400" src="https://i.pcmag.com/imagery/encyclopedia-terms/system370-_ibm370.fit_lim.size_800x.gif"/></a></div>
<br><br>
You would probably wonder why I would discuss a machine a few people have heard of and even fewer people will ever need to work with. To understand memory models like those from Java and C++ and to understand performance implications, I find it useful to have some insights into hardware memory models, how they are implemented, and especially how they are related. The reason why the IBM 370 is so interesting is that it shows a very nice transistion from sequential consistency (SC) to total store order (TSO).
<br><br>
TSO is the memory model that is used on the X86, but there are ISA's like the SPARC V8/v9 that offer TSO. A performance advantage of TSO over sequential consistency (SC) is that when a CPU does a store and there is a write miss, the CPU needs to wait for the cache line to be invalidated on all CPUs that use this cache line before it can write to the cache-line or it needs to wait for main memory. As a consequence, the performance of the CPU is reduced because it can't execute other instructions since this could lead to loads/stores being performed out of order in the memory order. Effectively it isn't able to hide the memory latency.
<br><br>
Instead of blocking, since non-speculative stores are going to be written to the cache anyway, the stores are written to a store buffer that sits between the CPU and the cache. The stores in the store buffer are in program order and will be committed to the cache in program order.
<h2>Store/Load reordering</h2>
Because stores are buffered, this can lead to store being reordered, in the memory order, with a load to a different address. This is demonstrated with the load buffering litmus test.
<pre>
Initial:
X=0, Y=0
CPU1:
X=1
r1=Y
CPU2:
Y=1
r2=X
</pre>
Can it be that r1=0, r2=0?
<br><br>
With SC this can't happen because no reordering is allowed (PO needs to be preserved in the memory order) and therefore an earlier load can't be reordered with a later store to a different address. But with TSO this can happen due to store buffers; even though the store retires before the load retires, the store is placed in the memory order after the load was placed in the memory order.
<br><br>
So with SC, the memory order will respect the PO of each thread; in other words it will be consistent with all 4 fences [LoadLoad][LoadStore][StoreLoad][StoreStore]. But since with TSO an older store can be reordered with a newer load to a different address, the memory order will only be consistent with preserved program order of each CPU: that is [LoadLoad][LoadStore][StoreStore] since the [StoreLoad] fence is be dropped.
<br><br>
<h2>Load/Store reordering</h2>
Is it allowed for an earlier load to be reordered with a later store to a different address? This can be expressed using the load buffering litmus test:
<pre>
Initial:
X=0, Y=0
CPU1:
r1=X
Y=1
CPU2:
r2=Y
X=1
</pre>
Is the outcome r1=1 and r2=1 possible?
<br><br>
With TSO this isn't allowed to reorder an older load with newer store to a different address. Think about it; the load is globally performed (read from cache) before it retires. The store will retire after the load and the store will only be globally performed (written to cache) after it retires. So an earlier load to a different address will be performed in the memory order before a later store to a different address.
<h2>Load/Load and Store/Store reordering</h2>
Is it allowed for an earlier load to be reordered with a later load to a different address? And the same goes for 2 stores? This is expressed using the message passing litmus test:
<pre>
Initial:
X=0, Y=0
CPU1:
X=1
Y=1
CPU2:
r1=Y
r2=X
</pre>
Can it be that r1=1 and r2=0?
<br><br>
With TSO this isn't allowed because it will preserve [LoadLoad] and [StoreStore]. The Intel X86 does make use of speculative out-of-order execution of loads, but if it detects a potential reordering, it will flush the pipeline and try again. This situation is called a memory order violation.
<h2>Store to Load forwarding (STLF)</h2>
One difference between SC and TSO is that TSO drops the [StoreLoad] fence. But there is another very important difference, with SC every load/store is atomic (such a store is called single copy atomic) and you get the guarantee that at least one total order exist.
<br><br>
TSO doesn't provide single copy atomic stores due to the store buffer. So imagine the following program:
<pre>
CPU1:
A=1
r1=A
</pre>
If the load would not look in the store buffer for the store, the CPU would not see its own store. So with TSO, the CPU can look into the store buffer before the CPU is committed to the cache. This is called store to load forwarding (STLF). The problem is that the CPU can now see its own store before other CPUs can. Such a store is called a multi-copy atomic since there are 2 copies; one in the store buffer and one in the cache.
<br><br>
This can lead to some weird behavior as is demonstrated with the following test:
<pre>
Initial:
X=0, Y=0
CPU1:
X=1
r1=X
r2=Y
CPU2:
Y=1
r3=Y
r4=X
</pre>
Can it be that r1=1, r3=1, and r2=0 and r4=0? So could it be that the r2=Y jumps before r1=X? And could r4=X jump before r3=Y? With TSO this is allowed because X=1 could be globally performed after r2=Y due to store buffers. r1=X can only be globally performed after X=1 is globally performed, so this means that r1=X is globally performed after r2=Y is globally performed. So effectively the store of X in the store buffer carries the load of X in the global memory order after the load of Y.
<br><br>
Keep in mind that globally performed isn't the same as executed because the r1=X could very well be executed before r2=Y.
<br><br>
The consequence is that in the memory order, X=1 happened before r1=X. But the only way to get this outcome is if r1=X is performed before X=1. This means that the store happened before the load and the load happened before the store; so we have a cycle on the same address. The consequence is that the memory order can't always order loads/stores to the same address and the memory order doesn't provide a total order over all loads/stores. This is a direct consequence of mult-copy store atomicity.
<h2>Total order over stores</h2>
Luckily, multi-copy store atomicity will still define a total order over the stores; so there is still a single moment in time where the store comes globally visible (when it commits to the cache). That is how TSO gets its name. So it will guarantee that at least 1 total order exist over stores issued by different CPU's to different addresses. This can be demonstrated with the Independent read of independent writes (IRIW) litmus test:
<pre>
Initial:
X=0, Y=0
CPU1:
X=1
CPU2:
Y=1
CPU3:
r1=X
r2=Y
CPU4:
r3=Y
r4=X
</pre>
Can it be that r1=1, r2=0, r3=1, r4=0; so could it be that the CPUs see the changes to different addresses issued by different CPUs in different orders? With TSO this isn't allowed because of the total order over the stores.
<h2>IBM 370</h2>
So how does the IBM 370 fit into the picture? The IBM 370 sits exactly between SC and TSO. As with SC, IBM 370 requires a total order over all loads/stores. And as with TSO, older stores can be reordered to newer loads to a different address; so the memory order will preserve [LoadLoad][LoadStore][StoreStore].
<br><br>
If there is a store followed by a load to the same address, it requires that the store becomes globally visible before the load can be performed. What is the point if the same value is returned for the load as with TSO? It will prevent the reordering of an earlier load that can be satisfied using STLF and a later load that need to come from the cache and this will ensure that the store is single-copy store atomic instead of multi-copy store atomic.
<br><br>
I'll explain it using the previous example.
<pre>
Initial:
X=0, Y=0
CPU1:
X=1
r1=X
r2=Y
CPU2:
Y=1
r3=Y
r4=X
</pre>
With TSO, the r1=X and r2=Y can be reordered due to STLF. But with IBM 370, the load of r1=X can only be globally performed after the X=1 is globally performed and the load of r2=Y can only be globally performed after r1=X is globally performed, so this forces the loads to be performed in order. And as a consequence we can't end up with r1=1, r2=0, r3=1,r4=0.
<br><br>
For more information check <a href="https://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf" target="_blank">Shared Memory Consistency Models: A Tutorial</a>.
pveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.com0tag:blogger.com,1999:blog-725906748590083291.post-39505404188567986962017-08-04T00:38:00.001-07:002017-08-06T01:16:18.315-07:00Percentiles and meanFor my work as performance engineer for Hazelcast, I need to deal with quite a lot of statistics.
Unfortunately statistics have never been my strongest topic; I really wish I paid more attention at high school. Some
basic understanding is critical if you want to do a proper analysis of test data and in this post I'll be zooming into the
problem of combining percentiles with means.
<p/>
For this post I'll be focusing on Latency test: executing requests with a fixed rate and recording the response times.
In case of latency based tests, mean latency and standard deviation are not very useful; you want to look at the percentiles
e.g. p99 or p99.9. In this post we'll focus on the p99.
<h1>The test data</h1>
For this test we assume the following dataset: there are 200 requests (2 per second) for 100 seconds and the following latency distribution:
<ul>
<li>190 measurements are 1ms</li>
<li>10 measurements are 400ms: these are equally spread over the 200 requests</li>
</ul>
<h1>p99</h1>
The 99 percentile (p99) is 400ms since 99% of all measures values are equal or smaller than 400ms. This is easy to see; just sort
sorting all items from small to large and jump to the 198 measurement which is 400ms, since the measurements from 190 to 200 are
400ms.
<p/>
Click on the following <a href="http://www.emathhelp.net/calculators/probability-statistics/percentile-calculator/?i=1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C1%2C+400%2C400%2C400%2C400%2C400%2C400%2C400%2C400%2C400%2C400&p=99&steps=on
">link</a> if you don't trust me.
<h1>p99 per 1 second interval</h1>
Having a p99 is very useful, but it won't help you if there are fluctuations (e.g. gc) during the test and you want to understand when this happened. For this you need to know how percentiles behave over time. In this case it is useful to determine the latency distribution on e.g. a 1 second interval instead of the whole distribution of 100 seconds.
<p/>
To calculate p99 per interval of 1 second for our test we'll get:
<ul>
<li>10 intervals with 2 measurements; 1ms + 400ms.</li>
<li>90 intervals with 2 measurements; 1ms + 1ms.</li>
</ul>
<p/>
If we calculate the p99 for each of these intervals, then we get:
<ul>
<li>10 intervals with a <a href="http://www.emathhelp.net/calculators/probability-statistics/percentile-calculator/?i=1%2C400&p=99&steps=on">p99 of 400ms</a></li>
<li>90 intervals with a <a href="http://www.emathhelp.net/calculators/probability-statistics/percentile-calculator/?i=1%2C1&p=99&steps=on">p99 of 1ms</a></li>
</ul>
Each of these intervals is coupled to a certain time so if the first bad intervals happens 24 seconds after the test starts, you can determine the absolute time and check for example the gc log.
<h1>Where it goes wrong: combining mean with percentiles</h1>
When you look at an interval p99 plot, it feels kind of normal to determine a the p99 based on the mean of these intervals.
In this case the mean p99 would be:
<pre>
(90*1+10*400)/100=40.9ms
</pre>
This is an order of magnitude less than the of the actual p99 of 400ms!
<p/>
The correct way to determine the p99 based on these intervals, is to merge each of the latency distributions into a single latency distribution, and then determine the p99.
<p/>
In the above example we played with mean in combination with interval p99, but it can just as easily be applied if you for example have multiple machines generating load and you want to combine the data of each of these machines. If you determine the p99 using the mean of the p99 of each of these machines, you get the same calculation error.
<p>
I hope that after reading this post, you start to become worried as soon as you see percentiles being combined with mean.
<h1>Gil Tene</h1>
Gil Tene has an excellent <a href="https://www.youtube.com/watch?v=lJ8ydIuPFeU">presentation</a> any self respecting performance engineer should have watched.
Where he zooms into typical problems including coordinated omission, percentiles and all kinds of statistics errors including combining mean with percentiles.
Gil also has written the excellent <a href="https://github.com/HdrHistogram/HdrHistogram">HdrHistogram</a> tool to track for example latency.
pveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.com0tag:blogger.com,1999:blog-725906748590083291.post-80721783805208266182017-04-23T08:36:00.000-07:002017-04-23T08:41:49.481-07:00JIT and instanceofOne of the things that always wondered me is how expensive an instanceof check is. In most cases this
is never a concern, but out of professional curiosity I wondered how well the JIT is able to optimize it.
Lets start with the simplest case, an instanceof of check using a final class as second argument. For this we use the
following example:
<pre>
public class InstanceOfCheck {
public static void main(String[] args) {
for (int k = 0; k < 100000; k++) {
doSomething("");
}
System.out.println(doSomething(""));
}
public static String doSomething(Object o) {
if (o.getClass() instanceof String.class) {
return "string";
} else {
return "nostring";
}
}
}
</pre>
For this post we use the following Java version:
<pre>
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
</pre>
I'm too lazy to update the JDK for my MacBook which I hardly use.
And the following JVM parameters:
<pre>
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*InstanceOfCheck.doSomething
-XX:+PrintCompilation
</pre>
This will give us the Assembly of the 'doSomething' method.
When I run the above example, I get the following output:
<pre>
125 1 b java.lang.String::indexOf (70 bytes)
128 2 % b java.lang.String::indexOf @ 37 (70 bytes)
137 3 b java.lang.String::hashCode (55 bytes)
192 4 b sun.nio.cs.UTF_8$Encoder::encode (359 bytes)
213 5 b java.lang.String::charAt (29 bytes)
216 6 b java.io.UnixFileSystem::normalize (75 bytes)
220 7 b java.lang.String::lastIndexOf (52 bytes)
243 8 b com.InstanceOfCheck::doSomething (13 bytes)
Compiled method (c2) 244 8 com.InstanceOfCheck::doSomething (13 bytes)
total in heap [0x0000000106479e10,0x000000010647a090] = 640
relocation [0x0000000106479f30,0x0000000106479f48] = 24
main code [0x0000000106479f60,0x0000000106479fc0] = 96
stub code [0x0000000106479fc0,0x0000000106479fd8] = 24
oops [0x0000000106479fd8,0x0000000106479fe0] = 8
metadata [0x0000000106479fe0,0x0000000106479ff0] = 16
scopes data [0x0000000106479ff0,0x000000010647a018] = 40
scopes pcs [0x000000010647a018,0x000000010647a078] = 96
dependencies [0x000000010647a078,0x000000010647a080] = 8
nul chk table [0x000000010647a080,0x000000010647a090] = 16
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x0000000106479e10:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x000000010e77a480} 'doSomething' '(Ljava/lang/Object;)Ljava/lang/String;' in 'com/InstanceOfCheck'
# parm0: rsi:rsi = 'java/lang/Object'
# [sp+0x20] (sp of caller)
0x0000000106479f60: mov DWORD PTR [rsp-0x14000],eax
0x0000000106479f67: push rbp
0x0000000106479f68: sub rsp,0x10 ;*synchronization entry
; - com.InstanceOfCheck::doSomething@-1 (line 14)
0x0000000106479f6c: mov r11d,DWORD PTR [rsi+0x8] ; implicit exception: dispatches to 0x0000000106479fa1
0x0000000106479f70: cmp r11d,0xf80002da ; {metadata('java/lang/String')}
0x0000000106479f77: jne 0x0000000106479f8f
0x0000000106479f79: movabs rax,0x7955dfce0 ; {oop("string")}
0x0000000106479f83: add rsp,0x10
0x0000000106479f87: pop rbp
0x0000000106479f88: test DWORD PTR [rip+0xffffffffff73f072],eax # 0x0000000105bb9000
; {poll_return}
0x0000000106479f8e: ret
0x0000000106479f8f: mov rbp,rsi
0x0000000106479f92: mov esi,0xffffffde
0x0000000106479f97: call 0x0000000106447120 ; OopMap{rbp=Oop off=60}
;*instanceof
; - com.InstanceOfCheck::doSomething@1 (line 14)
; {runtime_call}
0x0000000106479f9c: call 0x0000000104fdfe44 ; {runtime_call}
0x0000000106479fa1: mov esi,0xfffffff4
0x0000000106479fa6: nop
0x0000000106479fa7: call 0x0000000106447120 ; OopMap{off=76}
;*instanceof
; - com.InstanceOfCheck::doSomething@1 (line 14)
; {runtime_call}
0x0000000106479fac: call 0x0000000104fdfe44 ;*instanceof
; - com.InstanceOfCheck::doSomething@1 (line 14)
; {runtime_call}
0x0000000106479fb1: hlt
0x0000000106479fb2: hlt
0x0000000106479fb3: hlt
0x0000000106479fb4: hlt
0x0000000106479fb5: hlt
0x0000000106479fb6: hlt
0x0000000106479fb7: hlt
0x0000000106479fb8: hlt
0x0000000106479fb9: hlt
0x0000000106479fba: hlt
0x0000000106479fbb: hlt
0x0000000106479fbc: hlt
0x0000000106479fbd: hlt
0x0000000106479fbe: hlt
0x0000000106479fbf: hlt
[Exception Handler]
[Stub Code]
0x0000000106479fc0: jmp 0x000000010646bf60 ; {no_reloc}
[Deopt Handler Code]
0x0000000106479fc5: call 0x0000000106479fca
0x0000000106479fca: sub QWORD PTR [rsp],0x5
0x0000000106479fcf: jmp 0x0000000106446d00 ; {runtime_call}
0x0000000106479fd4: hlt
0x0000000106479fd5: hlt
0x0000000106479fd6: hlt
0x0000000106479fd7: hlt
OopMapSet contains 2 OopMaps
#0
OopMap{rbp=Oop off=60}
#1
OopMap{off=76}
247 9 % b com.InstanceOfCheck::main @ 2 (32 bytes)
248 9 % com.InstanceOfCheck::main @ -2 (32 bytes) made not entrant
string
</pre>
After filtering out the non relevant parts, the following remains:
<pre>
0x0000000106479f6c: mov r11d,DWORD PTR [rsi+0x8] ; implicit exception: dispatches to 0x0000000106479fa1
0x0000000106479f70: cmp r11d,0xf80002da ; {metadata('java/lang/String')}
0x0000000106479f77: jne 0x0000000106479f8f
0x0000000106479f79: movabs rax,0x7955dfce0 ; {oop("string")}
0x0000000106479f83: add rsp,0x10
0x0000000106479f87: pop rbp
0x0000000106479f88: test DWORD PTR [rip+0xffffffffff73f072],eax # 0x0000000105bb9000
; {poll_return}
0x0000000106479f8e: ret
</pre>
The first instruction loads the class address of the 'o' argument in register r11d. And the second instruction it is
compared with the String class address. If the address is not equal, we jump to the uncommon trap since we have only
passed String arguments to the 'doSomething' method.
So it seems that doing an instanceof check using a final class is very cheap. And we get exactly the same Assembly as direct
address comparison in the Java code:
<pre>
public static String doSomething(Object o) {
if (o.getClass() == String.class) {
return "string";
} else {
return "nostring";
}
}
</pre>
Which gives the following relevant Assembly:
<pre>
0x000000010e0b296c: mov r11d,DWORD PTR [rsi+0x8] ;*invokevirtual getClass
; - com.InstanceOfCheck::doSomething@1 (line 14)
; implicit exception: dispatches to 0x000000010e0b29b1
0x000000010e0b2970: cmp r11d,0xf80002da ; {metadata('java/lang/String')}
0x000000010e0b2977: jne 0x000000010e0b298f ;*if_acmpne
; - com.InstanceOfCheck::doSomething@6 (line 14)
0x000000010e0b2979: movabs rax,0x7955db818 ; {oop("string")}
0x000000010e0b2983: add rsp,0x10
0x000000010e0b2987: pop rbp
0x000000010e0b2988: test DWORD PTR [rip+0xffffffffff72a672],eax # 0x000000010d7dd000
; {poll_return}
0x000000010e0b298e: ret
</pre>
<h2>Non final but concrete class</h2>
In the previous example we saw that the JIT perfectly optimized an instanceof check comparing with a final class. But what about a non final concrete class like e.g. LinkedList? Will the JIT be able to optimize that? For this we modify the example:
<pre>
public class InstanceOfCheck {
public static void main(String[] args) {
List l = new LinkedList();
for (int k = 0; k < 100000; k++) {
doSomething(l);
}
System.out.println(doSomething(l));
}
public static String doSomething(Object o) {
if (o.getClass() == LinkedList.class) {
return "linkedlist";
} else {
return "nolinkedlist";
}
}
}
</pre>
If we extract only the relevant Assembly:
<pre>
0x000000010307ccec: mov r11d,DWORD PTR [rsi+0x8] ;*invokevirtual getClass
; - com.InstanceOfCheck::doSomething@1 (line 15)
; implicit exception: dispatches to 0x000000010307cd31
0x000000010307ccf0: cmp r11d,0xf8008c83 ; {metadata('java/util/LinkedList')}
0x000000010307ccf7: je 0x000000010307cd0f ;*if_acmpne
; - com.InstanceOfCheck::doSomething@6 (line 15)
0x000000010307ccf9: movabs rax,0x7955dd8d0 ; {oop("nostring")}
0x000000010307cd03: add rsp,0x10
0x000000010307cd07: pop rbp
0x000000010307cd08: test DWORD PTR [rip+0xffffffffff6642f2],eax # 0x00000001026e1000
; {poll_return}
0x000000010307cd0e: ret
</pre>
This Assembly looks very similar to the String version. The JIT sees that we only have passed arguments of type LinkedList and therefor it optimizes this with a cheap class address comparison and adds an uncommon trap in case of a different argument being passed.
<h2>Non concrete class check?</h2>
What would happen if we would check instanceof using a non concrete class like List. In this case it isn't possible to do a cheap class comparison:
<pre>
public class InstanceOfCheck {
public static void main(String[] args) {
List l = new LinkedList();
for (int k = 0; k < 100000; k++) {
doSomething(l);
}
System.out.println(doSomething(l));
}
public static String doSomething(Object o) {
if (o instanceof List) {
return "list";
} else {
return "nolist";
}
}
}
</pre>
When we extract the relevant Assembly we get the following:
<pre>
0x000000010807e66c: mov r11d,DWORD PTR [rsi+0x8] ; implicit exception: dispatches to 0x000000010807e6a1
0x000000010807e670: cmp r11d,0xf8008c83 ; {metadata('java/util/LinkedList')}
0x000000010807e677: jne 0x000000010807e68f
0x000000010807e679: movabs rax,0x7955ddf58 ; {oop("list")}
0x000000010807e683: add rsp,0x10
0x000000010807e687: pop rbp
0x000000010807e688: test DWORD PTR [rip+0xffffffffff61f972],eax # 0x000000010769e000
; {poll_return}
0x000000010807e68e: ret
</pre>
This is a nice surprise! Because we have only passed LinkedLists as arguments, the instanceof with List as argument
is implemented as a cheap pointer comparison with LinkedList class-address. If anything else than a LinkedList is
passed, an uncommon trap is executed. So this means that even in this case the JIT is able to optimize!
pveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.com0tag:blogger.com,1999:blog-725906748590083291.post-85202974822908996862017-01-30T06:15:00.000-08:002017-01-30T20:30:44.409-08:00JIT: Field of interface type and single implementation.In my previous <a href="http://pveentjer.blogspot.bg/2017/01/devirtualization.html">post</a>
the topic was devirtualization; an optimization applied by the JIT to reduce the overhead of
virtual method calls.
<p/>
In this post I'll analyze what the implications are for virtual method calls on a field with an interface
based type and a single loaded implementation of this interface. This is very common in a lot of applications
and considered a good practice e.g. to simplify testing by stubs/mocking the dependency. However in production
there will only be a single loaded implementation of this interface.
</p>
For this post I'll use the following interface. It has a single method 'size' that returns an
arbitrary value.
<pre>
interface Service {
int size();
}
</pre>
There is also a very basic implementation with a 'size' method that can easily be inlined:
<pre>
class ServiceImpl implements Service {
private int size = 1;
@Override
public int size() {
return size;
}
}
</pre>
And we have the following program:
<pre>
public class SingleImplementation {
private final Service service;
public SingleImplementation(Service service) {
this.service = service;
}
public int sizePlusOne() {
return service.size() + 1;
}
public static void main(String[] args) {
SingleImplementation f = new SingleImplementation(new ServiceImpl());
int result = 0;
for (int k = 0; k < 100_000; k++) {
result += f.sizePlusOne();
}
System.out.println("result:" + result);
}
}
</pre>
The focus will be on the 'sizePlusOne' method. The logic isn't very exciting, but it
will deliver easy to understand Assembly.
<p/>
For this post we have the following assumptions:
<ul>
<li>we only care about the output of the C2 compiler</li>
<li>we are using Java hotspot 1.8.0_91</li>
</ul>
<h1>Assembly</h1>
The question is what if there is any price to pay for using an interface as a field type.
<p/>
To determine this we'll be outputting the Assembly using the following JVM settings:
<pre>
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:CompileCommand=print,*SingleImplementation.sizePlusOne
</pre>
Tiered compilation is disabled since we only care for the C2 Assembly. Inlining is disabled
so we can focus on the 'sizePlusOne' method instead of it being lined in the 'main' method
and the JIT eliminating the whole loop and calculating the end value.
<p/>
After running the program the following Assembly is emitted:
<pre>
Compiled method (c2) 169 8 com.liskov.SingleImplementation::sizePlusOne (12 bytes)
total in heap [0x000000010304d0d0,0x000000010304d3a8] = 728
relocation [0x000000010304d1f0,0x000000010304d208] = 24
main code [0x000000010304d220,0x000000010304d2a0] = 128
stub code [0x000000010304d2a0,0x000000010304d2b8] = 24
oops [0x000000010304d2b8,0x000000010304d2c0] = 8
metadata [0x000000010304d2c0,0x000000010304d2d0] = 16
scopes data [0x000000010304d2d0,0x000000010304d300] = 48
scopes pcs [0x000000010304d300,0x000000010304d390] = 144
dependencies [0x000000010304d390,0x000000010304d398] = 8
nul chk table [0x000000010304d398,0x000000010304d3a8] = 16
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x000000010304d0d0:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Constants]
# {method} {0x00000001cb5b7478} 'sizePlusOne' '()I' in 'com/liskov/SingleImplementation'
# [sp+0x20] (sp of caller)
0x000000010304d220: mov r10d,DWORD PTR [rsi+0x8]
0x000000010304d224: shl r10,0x3
0x000000010304d228: cmp rax,r10
0x000000010304d22b: jne 0x0000000103019b60 ; {runtime_call}
0x000000010304d231: data32 xchg ax,ax
0x000000010304d234: nop DWORD PTR [rax+rax*1+0x0]
0x000000010304d23c: data32 data32 xchg ax,ax
[Verified Entry Point]
0x000000010304d240: mov DWORD PTR [rsp-0x14000],eax
0x000000010304d247: push rbp
0x000000010304d248: sub rsp,0x10 ;*synchronization entry
; - com.liskov.SingleImplementation::sizePlusOne@-1 (line 12)
0x000000010304d24c: mov r11d,DWORD PTR [rsi+0xc] ;*getfield service
; - com.liskov.SingleImplementation::sizePlusOne@1 (line 12)
0x000000010304d250: mov r10d,DWORD PTR [r12+r11*8+0x8]
; implicit exception: dispatches to 0x000000010304d289
0x000000010304d255: cmp r10d,0x31642b42 ; {metadata('com/liskov/ServiceImpl')}
0x000000010304d25c: jne 0x000000010304d274
0x000000010304d25e: lea r10,[r12+r11*8] ;*invokeinterface size
; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
0x000000010304d262: mov eax,DWORD PTR [r10+0xc]
0x000000010304d266: inc eax ;*iadd
; - com.liskov.SingleImplementation::sizePlusOne@10 (line 12)
0x000000010304d268: add rsp,0x10
0x000000010304d26c: pop rbp
0x000000010304d26d: test DWORD PTR [rip+0xfffffffffe7b1d8d],eax # 0x00000001017ff000
; {poll_return}
0x000000010304d273: ret
0x000000010304d274: mov esi,0xffffffde
0x000000010304d279: mov ebp,r11d
0x000000010304d27c: data32 xchg ax,ax
0x000000010304d27f: call 0x000000010301b120 ; OopMap{rbp=NarrowOop off=100}
;*invokeinterface size
; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
; {runtime_call}
0x000000010304d284: call 0x0000000102440e44 ; {runtime_call}
0x000000010304d289: mov esi,0xfffffff6
0x000000010304d28e: nop
0x000000010304d28f: call 0x000000010301b120 ; OopMap{off=116}
;*invokeinterface size
; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
; {runtime_call}
0x000000010304d294: call 0x0000000102440e44 ;*invokeinterface size
; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
; {runtime_call}
0x000000010304d299: hlt
0x000000010304d29a: hlt
0x000000010304d29b: hlt
0x000000010304d29c: hlt
0x000000010304d29d: hlt
0x000000010304d29e: hlt
0x000000010304d29f: hlt
[Exception Handler]
[Stub Code]
0x000000010304d2a0: jmp 0x000000010303ff60 ; {no_reloc}
[Deopt Handler Code]
0x000000010304d2a5: call 0x000000010304d2aa
0x000000010304d2aa: sub QWORD PTR [rsp],0x5
0x000000010304d2af: jmp 0x000000010301ad00 ; {runtime_call}
0x000000010304d2b4: hlt
0x000000010304d2b5: hlt
0x000000010304d2b6: hlt
0x000000010304d2b7: hlt
OopMapSet contains 2 OopMaps
#0
OopMap{rbp=NarrowOop off=100}
#1
OopMap{off=116}
</pre>
<p/>
The following code is the relevant Assembly:
<pre>
0x000000010304d24c: mov r11d,DWORD PTR [rsi+0xc]
; loads the 'service' field in r11d.
0x000000010304d250: mov r10d,DWORD PTR [r12+r11*8+0x8]
; loads the address of the class of the service object into r10d
0x000000010304d255: cmp r10d,0x31642b42
; compare the the address of the class with address ServiceImpl class
0x000000010304d25c: jne 0x000000010304d274
; if it isn't of type ServiceImpl, jump to the uncommon trap
0x000000010304d25e: lea r10,[r12+r11*8] ;*invokeinterface size
; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
0x000000010304d262: mov eax,DWORD PTR [r10+0xc]
; copy the size field into eax
0x000000010304d266: inc eax
; add 1 to eax
</pre>
We can conclude that the JIT has removed the virtual call and has even inlined the ServiceImpl.size method.
However what we can also see a typeguard. This is a bit surprising because there is only
a single Service implementation loaded. If a conflicting class would be loaded in the future, the class
hierarchy analysis should have spotted this and trigger a code deoptimization. Therefore there should not
be a need for a guard and a uncommon trap.
<h1>Abstract class to the rescue</h1>
I have posted the question why this typeguard was added on the
<a href="http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2017-January/025485.html">hotspot compiler dev mailinglist</a>.
And luckily <a href="https://shipilev.net/">Aleksey Shipilev</a> pointed out the cause of the problem: class hierarchy analysis
on Hotspot doesn't deal with interfaces correctly.
<p/>
Therefor I switched to an abstract class to see if the JIT is able to get rid of this typeguard.
<pre>
abstract class Service {
abstract int size();
}
class ServiceImpl extends Service {
private int size = 1;
@Override
public int size() {
return size;
}
}
</pre>
If the program is rerun, the following Assembly is emitted:
<pre>
Compiled method (c2) 186 8 com.liskov.SingleImplementation::sizePlusOne (10 bytes)
total in heap [0x000000010563c250,0x000000010563c4d0] = 640
relocation [0x000000010563c370,0x000000010563c380] = 16
main code [0x000000010563c380,0x000000010563c3e0] = 96
stub code [0x000000010563c3e0,0x000000010563c3f8] = 24
oops [0x000000010563c3f8,0x000000010563c400] = 8
metadata [0x000000010563c400,0x000000010563c420] = 32
scopes data [0x000000010563c420,0x000000010563c448] = 40
scopes pcs [0x000000010563c448,0x000000010563c4b8] = 112
dependencies [0x000000010563c4b8,0x000000010563c4c0] = 8
nul chk table [0x000000010563c4c0,0x000000010563c4d0] = 16
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x000000010563c250:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Constants]
# {method} {0x000000010d964478} 'sizePlusOne' '()I' in 'com/liskov/SingleImplementation'
# [sp+0x20] (sp of caller)
0x000000010563c380: mov r10d,DWORD PTR [rsi+0x8]
0x000000010563c384: shl r10,0x3
0x000000010563c388: cmp rax,r10
0x000000010563c38b: jne 0x0000000105607b60 ; {runtime_call}
0x000000010563c391: data32 xchg ax,ax
0x000000010563c394: nop DWORD PTR [rax+rax*1+0x0]
0x000000010563c39c: data32 data32 xchg ax,ax
[Verified Entry Point]
0x000000010563c3a0: mov DWORD PTR [rsp-0x14000],eax
0x000000010563c3a7: push rbp
0x000000010563c3a8: sub rsp,0x10 ;*synchronization entry
; - com.liskov.SingleImplementation::sizePlusOne@-1 (line 12)
0x000000010563c3ac: mov r11d,DWORD PTR [rsi+0xc] ;*getfield service
; - com.liskov.SingleImplementation::sizePlusOne@1 (line 12)
0x000000010563c3b0: mov eax,DWORD PTR [r12+r11*8+0xc]
; implicit exception: dispatches to 0x000000010563c3c3
0x000000010563c3b5: inc eax ;*iadd
; - com.liskov.SingleImplementation::sizePlusOne@8 (line 12)
0x000000010563c3b7: add rsp,0x10
0x000000010563c3bb: pop rbp
0x000000010563c3bc: test DWORD PTR [rip+0xfffffffffe790c3e],eax # 0x0000000103dcd000
; {poll_return}
0x000000010563c3c2: ret
0x000000010563c3c3: mov esi,0xfffffff6
0x000000010563c3c8: data32 xchg ax,ax
0x000000010563c3cb: call 0x0000000105609120 ; OopMap{off=80}
;*invokevirtual size
; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
; {runtime_call}
0x000000010563c3d0: call 0x0000000104a40e44 ;*invokevirtual size
; - com.liskov.SingleImplementation::sizePlusOne@4 (line 12)
; {runtime_call}
0x000000010563c3d5: hlt
0x000000010563c3d6: hlt
0x000000010563c3d7: hlt
0x000000010563c3d8: hlt
0x000000010563c3d9: hlt
0x000000010563c3da: hlt
0x000000010563c3db: hlt
0x000000010563c3dc: hlt
0x000000010563c3dd: hlt
0x000000010563c3de: hlt
0x000000010563c3df: hlt
[Exception Handler]
[Stub Code]
0x000000010563c3e0: jmp 0x000000010562df60 ; {no_reloc}
[Deopt Handler Code]
0x000000010563c3e5: call 0x000000010563c3ea
0x000000010563c3ea: sub QWORD PTR [rsp],0x5
0x000000010563c3ef: jmp 0x0000000105608d00 ; {runtime_call}
0x000000010563c3f4: hlt
0x000000010563c3f5: hlt
0x000000010563c3f6: hlt
0x000000010563c3f7: hlt
OopMapSet contains 1 OopMaps
#0
OopMap{off=80}
</pre>
After removing all noise, the following Assembly remains:
<pre>
0x000000010563c3ac: mov r11d,DWORD PTR [rsi+0xc] ;*getfield service
;; load the service field into register r11d
0x000000010563c3b0: mov eax,DWORD PTR [r12+r11*8+0xc]
;; load the size field of the service into register eax
0x000000010563c3b5: inc eax
;; add one to eax
</pre>
We can immediately see that the typeguard is gone.
<h1>Conclusions</h1>
If only a single implementation is loaded and field uses an:
<ol>
<li>interface based type, the method gets inlined but a typeguard is added.</li>
<li>abstract class based type, the method gets inlined and there is no typeguard.</li>
</ol>
So in case of an abstract class based field, there is no penalty to pay for applying Liskov Substitution Principle. In case of an interface base field, there is a extra unwanted typeguard.
<h1>Note</h1>
Please don't go and replace all your interfaces by abstract classes. In most cases code will be slow for other reasons than a simple typeguard. If it gets executed frequently, the branch predictor will make the right prediction. And hopefully in the near future the problems in the class hierarchy analysis get resolved.pveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.com0tag:blogger.com,1999:blog-725906748590083291.post-1357583063328858732017-01-25T09:24:00.000-08:002017-01-26T07:06:58.953-08:00final static boolean & JITFor this post we are going to look at the cost of having a final static boolean in the code. They can be very useful to enable or disable certain behavior e.g. tracing, logging etc. The question is what kind of performance implications it has.
<p/>
The reason for making this post is that I didn't know the implications and I asked the <a href="https://groups.google.com/forum/#!topic/mechanical-sympathy/bLgyv1NynvY">question</a> on the Mechanical Sympathy Mailinglist. So I would like to thank the people on this mailing list for answering my question.
<p/>
For this post we have the following assumptions:
<ul>
<li>we only care about the output of the C2 compiler</li>
<li>we are using Java hotspot 1.8.0_91</li>
</ul>
<h1>Constant expression</h1>
Let's start with the most basic case where the final static field is initialized using a constant expression:
<pre>
public class StaticFinal_ConstantExpression {
public static void main(String[] args) {
int result = 0;
for (int k = 0; k < 100_000; k++) {
result += doMath(k);
}
System.out.println(result);
}
final static boolean ENABLED = true;
public static int doMath(int a) {
if (ENABLED) {
return a + 1;
} else {
return a - 1;
}
}
}
</pre>
The actual logic in the 'doMath' isn't terribly exciting. The main purpose provide easy to understand bytecode or Assembly.
<p/>
When we check the bytecode for the 'doMath' method using 'javap -c StaticFinal_ConstantExpression.class' we get the following:
<pre>
public static int doMath(int);
Code:
0: iload_0
1: iconst_1
2: iadd
3: ireturn
</pre>
If we would convert this back to Java we would get:
<pre>
public static int doMath(int a) {
return a + 1;
}
</pre>
The Javac has propagated the ENABLED constant and completely removed the dead code. We don't even to look at the Assembly.
<p/>
Be careful with final statics and constant expressions; if the value is changed and one or more classes that read this value are not recompiled, they will not see the new value.
<h1>Non constant expression</h1>
In the previous example there was a hard coded constant value for ENABLED. In practice you often want something more flexible, e.g. using some kind of System property. So let's change the ENABLED initialization so it gets its value from a System property 'enabled'.
<pre>
public class StaticFinal_NonConstantExpression {
public static void main(String[] args) {
int result = 0;
for (int k = 0; k < 100_000; k++) {
result += doMath(k);
}
System.out.println(result);
}
final static boolean ENABLED = Boolean.getBoolean("enabled");
public static int doMath(int a) {
if (ENABLED) {
return a + 1;
} else {
return a - 1;
}
}
}
</pre>
And if we display the relevant bytecode using 'javap -c StaticFinal_NonConstantExpression.class', we get the following.
<pre>
static final boolean ENABLED;
public static int doMath(int);
Code:
0: getstatic #6 // Field ENABLED:Z
3: ifeq 10
6: iload_0
7: iconst_1
8: iadd
9: ireturn
10: iload_0
11: iconst_1
12: isub
13: ireturn
static {};
Code:
0: ldc #7 // String enabled
2: invokestatic #8 // Method java/lang/Boolean.getBoolean:(Ljava/lang/String;)Z
5: putstatic #6 // Field ENABLED:Z
8: return
</pre>
We can see that the 'doMath' still contains the check and the logic for both branches. The Javac has not made any optimizations since it doesn't know which value ENABLED is going to be at runtime.
<p/>
Lets go a level deeper and see what kind of Assembly we are going to get. To display the Assembly, we'll use the following parameters
<pre>
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:CompileCommand=print,*.doMath
-Denabled=true
</pre>
Tiered compilation is disabled since we are only interested in the C2 output. Inlining is disabled to prevent the 'doMath' method getting inlined into the main loop. Also we set the enabled system property to true.
<p/>
When we run we get the following Assembly
<pre>
Compiled method (c2) 248 8 com.constant_folding.StaticFinal_NonConstantExpression::doMath (14 bytes)
total in heap [0x00000001083a7a90,0x00000001083a7c60] = 464
relocation [0x00000001083a7bb0,0x00000001083a7bb8] = 8
main code [0x00000001083a7bc0,0x00000001083a7be0] = 32
stub code [0x00000001083a7be0,0x00000001083a7bf8] = 24
oops [0x00000001083a7bf8,0x00000001083a7c00] = 8
metadata [0x00000001083a7c00,0x00000001083a7c08] = 8
scopes data [0x00000001083a7c08,0x00000001083a7c18] = 16
scopes pcs [0x00000001083a7c18,0x00000001083a7c58] = 64
dependencies [0x00000001083a7c58,0x00000001083a7c60] = 8
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x00000001083a7a90:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x00000001d08e24b8} 'doMath' '(I)I' in 'com/constant_folding/StaticFinal_NonConstantExpression'
# parm0: rsi = int
# [sp+0x20] (sp of caller)
0x00000001083a7bc0: sub rsp,0x18
0x00000001083a7bc7: mov QWORD PTR [rsp+0x10],rbp ;*synchronization entry
; - com.constant_folding.StaticFinal_NonConstantExpression::doMath@-1 (line 16)
0x00000001083a7bcc: mov eax,esi
0x00000001083a7bce: inc eax ;*iadd
; - com.constant_folding.StaticFinal_NonConstantExpression::doMath@8 (line 17)
0x00000001083a7bd0: add rsp,0x10
0x00000001083a7bd4: pop rbp
0x00000001083a7bd5: test DWORD PTR [rip+0xfffffffffff74425],eax # 0x000000010831c000
; {poll_return}
0x00000001083a7bdb: ret
0x00000001083a7bdc: hlt
0x00000001083a7bdd: hlt
0x00000001083a7bde: hlt
0x00000001083a7bdf: hlt
[Exception Handler]
[Stub Code]
0x00000001083a7be0: jmp 0x000000010839af60 ; {no_reloc}
[Deopt Handler Code]
0x00000001083a7be5: call 0x00000001083a7bea
0x00000001083a7bea: sub QWORD PTR [rsp],0x5
0x00000001083a7bef: jmp 0x0000000108375d00 ; {runtime_call}
0x00000001083a7bf4: hlt
0x00000001083a7bf5: hlt
0x00000001083a7bf6: hlt
0x00000001083a7bf7: hlt
OopMapSet contains 0 OopMaps
</pre>
Lot of output. Let's remove everything that isn't relevant:
<pre>
0x00000001083a7bcc: mov eax,esi
;; copy the content of 'a' into eax
0x00000001083a7bce: inc eax
;; increase eax by one
</pre>
The JIT has propagated the ENABLED constant and removed the dead code.
<p/>
If we run with '-Denabled=false', we'll get similar Assembly:
<pre>
0x000000010b4eb7cc: mov eax,esi
0x000000010b4eb7ce: dec eax ;*isub
; - com.constant_folding.StaticFinal_NonConstantExpression::doMath@12 (line 19)
</pre>
So also in this case the JIT has propagated the constant and removed the dead code.
<h1>Original size of bytecode matters</h1>
So it seems that we can use static final with non constant expression to disable or enable certain behavior. Unfortunately this isn't true. Inlining can still be prevented because the choice to inline is determined based on the original bytecode size. To demonstrate this we'll use the following code:
<pre>
public class StaticFinal_OriginalSizeMatters {
public static void main(String[] args) {
int result = 0;
for (int k = 0; k < 1_000_000; k++) {
result += doMath(k);
}
System.out.println(result);
}
final static boolean ENABLED = Boolean.getBoolean("enabled");
public static int doMath(int a) {
if (ENABLED) {
System.out.print("n");
System.out.print("e");
System.out.print("v");
System.out.print("e");
System.out.print("r");
return a + 1;
} else {
return a - 1;
}
}
}
</pre>
When we run with using:
<pre>
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintInlining
-XX:FreqInlineSize=50
-Denabled=false
</pre>
We'll see the following output:
<pre>
@ 12 com.constant_folding.StaticFinal_OriginalSizeMatters::doMath (54 bytes) callee is too large
@ 27 java/io/PrintStream::println (not loaded) not inlineable
@ 12 com.constant_folding.StaticFinal_OriginalSizeMatters::doMath (54 bytes) callee is too large
@ 27 java/io/PrintStream::println (not loaded) not inlineable
@ 12 com.constant_folding.StaticFinal_OriginalSizeMatters::doMath (54 bytes) hot method too big
</pre>
So even though ENABLED is false, the method is still too fat to get inlined because the original bytecode is used.
<h1>Conclusion</h1>
A final static boolean with a constant expression is completely free. The Javac will do the constant propagation and dead code elimination and there is no price to pay.
<p/>
A final static boolean with a non constant expression will be fully optimized by the JIT. However inlining can be prevented because the original size of the bytecode determines if something gets inlined; not what the JIT made out of it.pveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.com2tag:blogger.com,1999:blog-725906748590083291.post-45049837315957431502017-01-23T11:54:00.001-08:002017-01-23T11:55:01.399-08:00DevirtualizationWhen a non static method is called on an interface or class, an `invokeinterface` or
`invokevirtual` bytecode is emitted. This makes it possible to do polymorphic calls;
so at runtime determining which method implementation should be executed.
<p/>
The following useless method is used in this post.
<pre>
public static int sizePlusOne(Collection c) {
return c.size() + 1;
}
</pre>
The sizePlusOne function calls the size method on the passed collection and increments
the result by one.
<p/>
If we view the bytecode using 'javap -c SomeClass' then we get the following output:
<pre>
public static int sizePlusOne(java.util.Collection);
Code:
0: aload_0
1: invokeinterface #18, 1 // InterfaceMethod java/util/Collection.size:()I
6: iconst_1
7: iadd
8: ireturn
</pre>
The problem is that a `invokeinterface` or `invokevirtual` is a dynamic dispatch instead
of a static dispatch. A dynamic dispatch involves more complexity because a vtable lookup
is required and it prevents inlining since the concrete method implementation isn't known.
<p/>
What the JIT does is speculate that there will only be 1 or 2 types at most and instead of
doing a dynamic dispatch, there are some cheap guards and static dispatching. To test this
we'll be using Hotspot 1.8.0_91. Keep in mind that in the future the optimizations could be
implemented differently.
<p/>
For this post we assume that:
<ul>
<li>sizePlusOne method is called equally with the different types. There is not a dominant type.</li>
<li>there are many implementations loaded for interface. Having only a single loaded implementation
deserves a post on its own.</li>
<li>only the C2 code is relevant</li>
</ul>
<h1>Monomorphic callsite</h1>
The simplest version of the optimization is a monomorphic call site: so only a single type
being used.
</p>
To demonstrate the monomorphic callsite, we use the following example:
<pre>
public class Devirtualization_MonoMorphic {
public static void main(String[] args) {
ArrayList arrayList = new ArrayList();
int result = 0;
for (int k = 0; k < 100_000; k++) {
result += sizePlusOne(arrayList);
}
System.out.println("result:" + result);
}
public static int sizePlusOne(Collection c) {
return c.size() + 1;
}
}
</pre>
So we loop 100k times and execute the 'sizePlusOne' method with an ArrayList.
<p/>
And we run it with the following settings to get the Assembly:
<pre>
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*Devirtualization_MonoMorphic.sizePlusOne
</pre>
Inlining is disabled so we can focus on the content of the sizePlusOne method. This will not prevent inlining
of simple accessors like ArrayList/LinkedList.size. For more information see '-XX:-InlineAccessors' and
credits go to Andrew Haley for pointing this out. Background compilation is disabled so that compilation isn't
done asynchronous and tiered compilation disabled since for this blogpost only the C2 emitted Assembly will
be discussed.
<p/>
When the program is run, the following Assembly is printed.
<pre>
CompilerOracle: print *Devirtualization_MonoMorphic.sizePlusOne
Compiled method (c2) 247 9 com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne (9 bytes)
total in heap [0x000000010be7e3d0,0x000000010be7e678] = 680
relocation [0x000000010be7e4f0,0x000000010be7e508] = 24
main code [0x000000010be7e520,0x000000010be7e580] = 96
stub code [0x000000010be7e580,0x000000010be7e598] = 24
oops [0x000000010be7e598,0x000000010be7e5a0] = 8
metadata [0x000000010be7e5a0,0x000000010be7e5b0] = 16
scopes data [0x000000010be7e5b0,0x000000010be7e5e0] = 48
scopes pcs [0x000000010be7e5e0,0x000000010be7e660] = 128
dependencies [0x000000010be7e660,0x000000010be7e668] = 8
nul chk table [0x000000010be7e668,0x000000010be7e678] = 16
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x000000010be7e3d0:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x000000011417a560} 'sizePlusOne' '(Ljava/util/Collection;)I' in 'com/devirtualization/Devirtualization_MonoMorphic'
# parm0: rsi:rsi = 'java/util/Collection'
# [sp+0x20] (sp of caller)
0x000000010be7e520: mov DWORD PTR [rsp-0x14000],eax
0x000000010be7e527: push rbp
0x000000010be7e528: sub rsp,0x10 ;*synchronization entry
; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@-1 (line 21)
0x000000010be7e52c: mov r11d,DWORD PTR [rsi+0x8] ; implicit exception: dispatches to 0x000000010be7e55d
0x000000010be7e530: cmp r11d,0xf8003231 ; {metadata('java/util/ArrayList')}
0x000000010be7e537: jne 0x000000010be7e54a ;*invokeinterface size
; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)
0x000000010be7e539: mov eax,DWORD PTR [rsi+0x10]
0x000000010be7e53c: inc eax ;*iadd
; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@7 (line 21)
0x000000010be7e53e: add rsp,0x10
0x000000010be7e542: pop rbp
0x000000010be7e543: test DWORD PTR [rip+0xffffffffff6fdab7],eax # 0x000000010b57c000
; {poll_return}
0x000000010be7e549: ret
0x000000010be7e54a: mov rbp,rsi
0x000000010be7e54d: mov esi,0xffffffde
0x000000010be7e552: nop
0x000000010be7e553: call 0x000000010be47120 ; OopMap{rbp=Oop off=56}
;*invokeinterface size
; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)
; {runtime_call}
0x000000010be7e558: call 0x000000010a9a2e44 ; {runtime_call}
0x000000010be7e55d: mov esi,0xfffffff6
0x000000010be7e562: nop
0x000000010be7e563: call 0x000000010be47120 ; OopMap{off=72}
;*invokeinterface size
; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)
; {runtime_call}
0x000000010be7e568: call 0x000000010a9a2e44 ;*invokeinterface size
; - com.devirtualization.Devirtualization_MonoMorphic::sizePlusOne@1 (line 21)
; {runtime_call}
0x000000010be7e56d: hlt
0x000000010be7e56e: hlt
0x000000010be7e56f: hlt
0x000000010be7e570: hlt
0x000000010be7e571: hlt
0x000000010be7e572: hlt
0x000000010be7e573: hlt
0x000000010be7e574: hlt
0x000000010be7e575: hlt
0x000000010be7e576: hlt
0x000000010be7e577: hlt
0x000000010be7e578: hlt
0x000000010be7e579: hlt
0x000000010be7e57a: hlt
0x000000010be7e57b: hlt
0x000000010be7e57c: hlt
0x000000010be7e57d: hlt
0x000000010be7e57e: hlt
0x000000010be7e57f: hlt
[Exception Handler]
[Stub Code]
0x000000010be7e580: jmp 0x000000010be6bf60 ; {no_reloc}
[Deopt Handler Code]
0x000000010be7e585: call 0x000000010be7e58a
0x000000010be7e58a: sub QWORD PTR [rsp],0x5
0x000000010be7e58f: jmp 0x000000010be46d00 ; {runtime_call}
0x000000010be7e594: hlt
0x000000010be7e595: hlt
0x000000010be7e596: hlt
0x000000010be7e597: hlt
OopMapSet contains 2 OopMaps
#0
OopMap{rbp=Oop off=56}
#1
OopMap{off=72}
</pre>
That is a lot output. Lets remove everything not relevant and add some comments:
<pre>
0x000000010be7e52c: mov r11d,DWORD PTR [rsi+0x8]
;; Loads the address of the class of 'c' into register r11
0x000000010be7e530: cmp r11d,0xf8003231
;; Compare that address with the address of the ArrayList class
0x000000010be7e537: jne 0x000000010be7e54a
;; if the classes are different, jump to an uncommon trap
0x000000010be7e539: mov eax,DWORD PTR [rsi+0x10]
;; copies the size field of the ArrayList object into eax
0x000000010be7e53c: inc eax ;*iadd
;; Add one to eax.
</pre>
If this would be converted to pseudo Java code, we would get something like this:
<pre>
if(collection.class != ArrayList.class){
uncommonTrap()
}
int result = c.size;
result++;
</pre>
<h2>Conclusion</h2>
With only a single type at the call site, the Assembly only has a simple type-guard and then
a static dispatch. So the dynamic dispatch is replaced by a static dispatch and inlining of
the method is possible.
<h1>Bimorphic callsite</h1>
The previous example has only 1 concrete type at the callsite. But what happens when a second
type is added.
<pre>
public class Devirtualization_BiMorphic {
public static void main(String[] args) {
ArrayList arrayList = new ArrayList();
LinkedList linkedList = new LinkedList();
int result = 0;
for (int k = 0; k < 100_000; k++) {
result += sizePlusOne(arrayList);
}
for (int k = 0; k < 100_000; k++) {
result += sizePlusOne(linkedList);
}
System.out.println("result:" + result);
}
public static int sizePlusOne(Collection c) {
return c.size() + 1;
}
}
</pre>
And we run with the following settings:
<pre>
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*Devirtualization_BiMorphic.sizePlusOne
</pre>
First the Assembly of the the monomorphic version is emited, and as soon as the LinkedList is encountered,
the uncommon trap is executed and eventually new Assembly emitted:
<pre>
Compiled method (c2) 291 11 com.devirtualization.Devirtualization_BiMorphic::sizePlusOne (9 bytes)
total in heap [0x000000010c339510,0x000000010c3397f8] = 744
relocation [0x000000010c339630,0x000000010c339648] = 24
main code [0x000000010c339660,0x000000010c3396c0] = 96
stub code [0x000000010c3396c0,0x000000010c3396d8] = 24
oops [0x000000010c3396d8,0x000000010c3396e0] = 8
metadata [0x000000010c3396e0,0x000000010c339700] = 32
scopes data [0x000000010c339700,0x000000010c339730] = 48
scopes pcs [0x000000010c339730,0x000000010c3397e0] = 176
dependencies [0x000000010c3397e0,0x000000010c3397e8] = 8
nul chk table [0x000000010c3397e8,0x000000010c3397f8] = 16
Decoding compiled method 0x000000010c339510:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x00000001146395f0} 'sizePlusOne' '(Ljava/util/Collection;)I' in 'com/devirtualization/Devirtualization_BiMorphic'
# parm0: rsi:rsi = 'java/util/Collection'
# [sp+0x20] (sp of caller)
0x000000010c339660: mov DWORD PTR [rsp-0x14000],eax
0x000000010c339667: push rbp
0x000000010c339668: sub rsp,0x10 ;*synchronization entry
; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@-1 (line 28)
0x000000010c33966c: mov r10d,DWORD PTR [rsi+0x8] ; implicit exception: dispatches to 0x000000010c3396ad
0x000000010c339670: cmp r10d,0xf8003231 ; {metadata('java/util/ArrayList')}
0x000000010c339677: je 0x000000010c339687
0x000000010c339679: cmp r10d,0xf8008c83 ; {metadata('java/util/LinkedList')}
0x000000010c339680: jne 0x000000010c339698 ;*invokeinterface size
; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
0x000000010c339682: mov eax,DWORD PTR [rsi+0x10] ;*getfield size
; - java.util.LinkedList::size@1 (line 326)
; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
0x000000010c339685: jmp 0x000000010c33968a ;*invokeinterface size
; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
0x000000010c339687: mov eax,DWORD PTR [rsi+0x10] ;*synchronization entry
; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@-1 (line 28)
0x000000010c33968a: inc eax ;*iadd
; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@7 (line 28)
0x000000010c33968c: add rsp,0x10
0x000000010c339690: pop rbp
0x000000010c339691: test DWORD PTR [rip+0xfffffffffe841969],eax # 0x000000010ab7b000
; {poll_return}
0x000000010c339697: ret
0x000000010c339698: mov rbp,rsi
0x000000010c33969b: mov esi,0xffffffc6
0x000000010c3396a0: data32 xchg ax,ax
0x000000010c3396a3: call 0x000000010c306120 ; OopMap{rbp=Oop off=72}
;*invokeinterface size
; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
; {runtime_call}
0x000000010c3396a8: call 0x000000010b840e44 ; {runtime_call}
0x000000010c3396ad: mov esi,0xfffffff6
0x000000010c3396b2: nop
0x000000010c3396b3: call 0x000000010c306120 ; OopMap{off=88}
;*invokeinterface size
; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
; {runtime_call}
0x000000010c3396b8: call 0x000000010b840e44 ;*invokeinterface size
; - com.devirtualization.Devirtualization_BiMorphic::sizePlusOne@1 (line 28)
; {runtime_call}
0x000000010c3396bd: hlt
0x000000010c3396be: hlt
0x000000010c3396bf: hlt
[Exception Handler]
[Stub Code]
0x000000010c3396c0: jmp 0x000000010c32af60 ; {no_reloc}
[Deopt Handler Code]
0x000000010c3396c5: call 0x000000010c3396ca
0x000000010c3396ca: sub QWORD PTR [rsp],0x5
0x000000010c3396cf: jmp 0x000000010c305d00 ; {runtime_call}
0x000000010c3396d4: hlt
0x000000010c3396d5: hlt
0x000000010c3396d6: hlt
0x000000010c3396d7: hlt
OopMapSet contains 2 OopMaps
#0
OopMap{rbp=Oop off=72}
#1
OopMap{off=88}
</pre>
Again a lot of output. Lets trim it down a bit:
<pre>
0x000000010c33966c: mov r10d,DWORD PTR [rsi+0x8]
;; Loads the address of the class of 'c' into register r10d
0x000000010c339670: cmp r10d,0xf8003231
;; Compare that address with the address of the ArrayList class
0x000000010c339677: je 0x000000010c339687
;; if it is an ArrayList, jump to 0x000000010c339687
0x000000010c339679: cmp r10d,0xf8008c83
;; it was not an ArrayList, lets compare with address of LinkedList class
0x000000010c339680: jne 0x000000010c339698
;; if jump to uncommon trap
0x000000010c339682: mov eax,DWORD PTR [rsi+0x10]
;; copies the size field of the LinkedList instance into eax
0x000000010c339685: jmp 0x000000010c33968a
0x000000010c339687: mov eax,DWORD PTR [rsi+0x10] ;*synchronization entry
;; copes the size field of the ArrayList instance into eax
0x000000010c33968a: inc eax
;; Add one to eax.
</pre>
If we would convert this back pseudo Java we would get something like this:
<pre>
int result
if(c.class == ArrayList.class){
result = c.size;
} else if(c.class == LinkedList.class){
result = c.size;
} else {
uncommonTrap();
}
result++;
</pre>
<h2>Conclusion</h2>
In case of a bimorphic callsite, we don't need to pay the price for a dynamic dispatch
and inlining isn't prevented. However we do get 1 type-guard for one of the types and 2
type-guards for the other.
<h1>Megamorpic callsite</h1>
Lets add a third type of Collection and see what happens:
<pre>
public class Devirtualization_MegaMorphic {
public static void main(String[] args) {
ArrayList arrayList = new ArrayList();
LinkedList linkedList = new LinkedList();
HashSet hashSet = new HashSet();
int result = 0;
for (int k = 0; k < 100_000; k++) {
result += sizePlusOne(arrayList);
}
for (int k = 0; k < 100_000; k++) {
result += sizePlusOne(linkedList);
}
for (int k = 0; k < 100_000; k++) {
result += sizePlusOne(hashSet);
}
System.out.println("result:" + result);
}
public static int sizePlusOne(Collection c) {
return c.size() + 1;
}
}
</pre>
If we run with the following settings
<pre>
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*Devirtualization_MegaMorphic.sizePlusOne
</pre>
Then the following Assembly is eventually emitted:
<pre>
Compiled method (c2) 268 14 com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne (9 bytes)
total in heap [0x000000010ea7d550,0x000000010ea7d788] = 568
relocation [0x000000010ea7d670,0x000000010ea7d680] = 16
main code [0x000000010ea7d680,0x000000010ea7d6c0] = 64
stub code [0x000000010ea7d6c0,0x000000010ea7d6d8] = 24
oops [0x000000010ea7d6d8,0x000000010ea7d6e0] = 8
metadata [0x000000010ea7d6e0,0x000000010ea7d6e8] = 8
scopes data [0x000000010ea7d6e8,0x000000010ea7d708] = 32
scopes pcs [0x000000010ea7d708,0x000000010ea7d768] = 96
dependencies [0x000000010ea7d768,0x000000010ea7d770] = 8
handler table [0x000000010ea7d770,0x000000010ea7d788] = 24
Decoding compiled method 0x000000010ea7d550:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x0000000116d7a678} 'sizePlusOne' '(Ljava/util/Collection;)I' in 'com/devirtualization/Devirtualization_MegaMorphic'
# parm0: rsi:rsi = 'java/util/Collection'
# [sp+0x20] (sp of caller)
0x000000010ea7d680: mov DWORD PTR [rsp-0x14000],eax
0x000000010ea7d687: push rbp
0x000000010ea7d688: sub rsp,0x10 ;*synchronization entry
; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@-1 (line 33)
0x000000010ea7d68c: nop
0x000000010ea7d68d: movabs rax,0xffffffffffffffff
0x000000010ea7d697: call 0x000000010ea45f60 ; OopMap{off=28}
;*invokeinterface size
; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@1 (line 33)
; {virtual_call}
0x000000010ea7d69c: inc eax ;*iadd
; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@7 (line 33)
0x000000010ea7d69e: add rsp,0x10
0x000000010ea7d6a2: pop rbp
0x000000010ea7d6a3: test DWORD PTR [rip+0xffffffffff742957],eax # 0x000000010e1c0000
; {poll_return}
0x000000010ea7d6a9: ret ;*invokeinterface size
; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@1 (line 33)
0x000000010ea7d6aa: mov rsi,rax
0x000000010ea7d6ad: add rsp,0x10
0x000000010ea7d6b1: pop rbp
0x000000010ea7d6b2: jmp 0x000000010ea6ece0 ; {runtime_call}
0x000000010ea7d6b7: hlt
0x000000010ea7d6b8: hlt
0x000000010ea7d6b9: hlt
0x000000010ea7d6ba: hlt
0x000000010ea7d6bb: hlt
0x000000010ea7d6bc: hlt
0x000000010ea7d6bd: hlt
0x000000010ea7d6be: hlt
0x000000010ea7d6bf: hlt
[Exception Handler]
[Stub Code]
0x000000010ea7d6c0: jmp 0x000000010ea6bf60 ; {no_reloc}
[Deopt Handler Code]
0x000000010ea7d6c5: call 0x000000010ea7d6ca
0x000000010ea7d6ca: sub QWORD PTR [rsp],0x5
0x000000010ea7d6cf: jmp 0x000000010ea46d00 ; {runtime_call}
0x000000010ea7d6d4: hlt
0x000000010ea7d6d5: hlt
0x000000010ea7d6d6: hlt
0x000000010ea7d6d7: hlt
</pre>
Again a lot of output. If we extract only the relevant part, we get:
<pre>
0x000000010ea7d68d: movabs rax,0xffffffffffffffff
0x000000010ea7d697: call 0x000000010ea45f60 ; OopMap{off=28}
;*invokeinterface size
; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@1 (line 33)
; {virtual_call}
0x000000010ea7d69c: inc eax ;*iadd
; - com.devirtualization.Devirtualization_MegaMorphic::sizePlusOne@7 (line 33)
</pre>
The code above contains the expensive virtual call to Collection.size.
<p/>
<h2>Conclusion</h2>
The JIT has given up on devirtualizing the callsite after it has 3 encountered different implementations.
<h1>Recommended reads</h1>
I would start out with <a href="https://shipilev.net/blog/2015/black-magic-method-dispatch/">this</a> excellent post of Aleksey Shipilëv. He goes into much more detail.
<p/>
Very good <a href="https://dzone.com/articles/too-fast-too-megamorphic-what"/>post</a> from Richard Warburton.
<p/>
And very informative <a href="https://mechanical-sympathy.blogspot.bg/2012/04/invoke-interface-optimisations.html">post</a> from Martin Thompson.
pveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.com0tag:blogger.com,1999:blog-725906748590083291.post-50322614051028747042017-01-21T11:22:00.002-08:002017-01-21T12:27:28.935-08:00Assert & Bytecode & AssemblyThis is the first of a series of posts that shows how certain language features and compiler optimisations are implemented on the JVM; on bytecode level and the Assembly emitted by the JIT. There are many blogposts/articles/presentation telling that these optimisations are done, but don't explain what is happening under the hood. So if you don't like to get your hands dirty, please move along.
<p/>
I have been inspired to start blogging again by people like Gil Tene, Martin Thompson, Nits Wakart,
Peter Lawrey, Cliff Click and many others. They provide so much insights of low level performance issues to the community, that I want to contribute as well.
<h1>Java assert</h1>
Java asserts have been added to Java 1.4 and are an easy way to check preconditions, postconditions
and invariants. A nice feature of asserts is that they can easily be enabled/disabled either globally
using `-da` but can also be disabled on package or class level. For more information see the following
<a href="https://docs.oracle.com/cd/E19683-01/806-7930/assert-4/index.html">page</a>.
<p/>
Lets have a look at the following example:
<pre>
public class Assert {
public static void main(String[] args) {
long l = 0;
for (int k = 0; k < 100_000; k++) {
l += twice(k);
}
System.out.println(l);
}
public static int twice(int a) {
assert a != 0 : "a can't be 0";
return a * 2;
}
}
</pre>
We have a loop that does a simple calculation. We also store the result and print it so that the JIT doesn't apply dead code elimination. In the 'twice' method there is a simple assert that checks if 'a' is not 0. The assert condition isn't very important, its primary function is to give easy to understand Assembly.
<p/>
If we compile the code and output the content of the class it using 'javap -c Assert.class', we get the following bytecode:
<pre>
public class com.asserts.Assert {
static final boolean $assertionsDisabled;
public com.asserts.Assert();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]);
Code:
0: lconst_0
1: lstore_1
2: iconst_0
3: istore_3
4: iload_3
5: ldc #2 // int 100000
7: if_icmpge 24
10: lload_1
11: iload_3
12: invokestatic #3 // Method twice:(I)I
15: i2l
16: ladd
17: lstore_1
18: iinc 3, 1
21: goto 4
24: getstatic #4 // Field java/lang/System.out:Ljava/io/PrintStream;
27: lload_1
28: invokevirtual #5 // Method java/io/PrintStream.println:(J)V
31: return
public static int twice(int);
Code:
0: getstatic #6 // Field $assertionsDisabled:Z
3: ifne 20
6: iload_0
7: ifeq 20
10: new #7 // class java/lang/AssertionError
13: dup
14: ldc #8 // String a can't be 0
16: invokespecial #9 // Method java/lang/AssertionError."<init>":(Ljava/lang/Object;)V
19: athrow
20: iload_0
21: iconst_2
22: imul
23: ireturn
static {};
Code:
0: ldc #10 // class com/asserts/Assert
2: invokevirtual #11 // Method java/lang/Class.desiredAssertionStatus:()Z
5: ifne 12
8: iconst_1
9: goto 13
12: iconst_0
13: putstatic #6 // Field $assertionsDisabled:Z
16: return
}
</pre>
It is interesting to see how the assert has been implemented. A new synthetic static final field `$assertionsDisabled` has been added to the class which gets initialized when the class is loaded. See the 'static {}' for more detail.
<p/>
The first instruction of the `twice` method loads this static field and then there is a check if this field is set. If `$assertionsDisabledis` false, it continues with the assert statement on #6. If `$assertionsDisabled` is true (so assert disabled), then there is a jump to the actual logic on #20.
<p/>
If we convert the 'twice' bytecode back to Java, we would get:
<pre>
public class Assert {
static final boolean $assertionsDisabled;
public static int twice(int a) {
if(!$assertionsDisabled){
if(a == 0) {
throw new AssertError("a can't be 0")
}
}
return a * 2;
}
}
</pre>
So the assert doesn't require any special new bytecode instructions. It is translated to a few simple checks.
<h1>Assert disabled</h1>
As a performance engineer, I'm always worried if we need to pay the price for something that isn't used. In this case, on the bytecode level, there is one extra branch to deal with when assertions are disabled and 2 branches when assertions are enabled. Till so far I'm not terribly excited since branching and high performance code, is not really a good mixture.
<p/>
So lets see if the JIT can optimize it if assertions are disabled. For this I'm using the following commands:
<pre>
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:CompileCommand=print,*Assert.twice
</pre>
The TieredCompilation is disabled so we only get the C2 Assembly output. Inlining is disabled so that the 'twice' method doesn't get inlined in the main loop.
<p/>
This provides us the following output:
<pre>
Compiled method (c2) 190 8 com.asserts.Assert::twice (24 bytes)
total in heap [0x0000000101d7f310,0x0000000101d7f4e0] = 464
relocation [0x0000000101d7f430,0x0000000101d7f438] = 8
main code [0x0000000101d7f440,0x0000000101d7f460] = 32
stub code [0x0000000101d7f460,0x0000000101d7f478] = 24
oops [0x0000000101d7f478,0x0000000101d7f480] = 8
metadata [0x0000000101d7f480,0x0000000101d7f488] = 8
scopes data [0x0000000101d7f488,0x0000000101d7f498] = 16
scopes pcs [0x0000000101d7f498,0x0000000101d7f4d8] = 64
dependencies [0x0000000101d7f4d8,0x0000000101d7f4e0] = 8
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x0000000101d7f310:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x00000001ca2ba4f8} 'twice' '(I)I' in 'com/asserts/Assert'
# parm0: rsi = int
# [sp+0x20] (sp of caller)
0x0000000101d7f440: sub rsp,0x18
0x0000000101d7f447: mov QWORD PTR [rsp+0x10],rbp ;*synchronization entry
; - com.asserts.Assert::twice@-1 (line 14)
0x0000000101d7f44c: mov eax,esi
0x0000000101d7f44e: shl eax,1 ;*imul
; - com.asserts.Assert::twice@22 (line 15)
0x0000000101d7f450: add rsp,0x10
0x0000000101d7f454: pop rbp
0x0000000101d7f455: test DWORD PTR [rip+0xfffffffffe87fba5],eax # 0x00000001005ff000
; {poll_return}
0x0000000101d7f45b: ret
0x0000000101d7f45c: hlt
0x0000000101d7f45d: hlt
0x0000000101d7f45e: hlt
0x0000000101d7f45f: hlt
[Exception Handler]
[Stub Code]
0x0000000101d7f460: jmp 0x0000000101d70f60 ; {no_reloc}
[Deopt Handler Code]
0x0000000101d7f465: call 0x0000000101d7f46a
0x0000000101d7f46a: sub QWORD PTR [rsp],0x5
0x0000000101d7f46f: jmp 0x0000000101d4bd00 ; {runtime_call}
0x0000000101d7f474: hlt
0x0000000101d7f475: hlt
0x0000000101d7f476: hlt
0x0000000101d7f477: hlt
OopMapSet contains 0 OopMaps
</pre>
There is quite a lot of output. If we only focus on the logic of the method and strip all ceremony,
the following instructions remain:
<pre>
0x0000000101d7f44c: mov eax,esi
0x0000000101d7f44e: shl eax,1 ;*imul
; - com.asserts.Assert::twice@22 (line 15)
</pre>
The 'a' argument (stored in register esi) is copied into the eax register. Then we multiply the eax register by shifting the bits one to the left and store the result back into eax register. The eax register will be used to access the return value.
Replacing the multiply by a shift left is a peephole optimization since bitshifting is faster than doing a multiply.
The conclusion we can make is that if assert is disabled, the JIT completely removes the '$assertionsDisabled' check. That is good news; I hate to pay the price for something not used.
<h1>Assert enabled</h1>
What if assertions are enabled? For this we run with the following settings
<pre>
-XX:+UnlockDiagnosticVMOptions
-XX:PrintAssemblyOptions=intel
-XX:-TieredCompilation
-XX:-Inline
-XX:-BackgroundCompilation
-XX:CompileCommand=print,*Assert.twice
-ea
</pre>
<p/>
And we get the following Assembly
<pre>
Compiled method (c2) 168 8 com.asserts.Assert::twice (24 bytes)
total in heap [0x0000000109e7c310,0x0000000109e7c548] = 568
relocation [0x0000000109e7c430,0x0000000109e7c440] = 16
main code [0x0000000109e7c440,0x0000000109e7c480] = 64
stub code [0x0000000109e7c480,0x0000000109e7c498] = 24
oops [0x0000000109e7c498,0x0000000109e7c4a0] = 8
metadata [0x0000000109e7c4a0,0x0000000109e7c4a8] = 8
scopes data [0x0000000109e7c4a8,0x0000000109e7c4d0] = 40
scopes pcs [0x0000000109e7c4d0,0x0000000109e7c540] = 112
dependencies [0x0000000109e7c540,0x0000000109e7c548] = 8
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x0000000109e7c310:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x000000011217a4f8} 'twice' '(I)I' in 'com/asserts/Assert'
# parm0: rsi = int
# [sp+0x20] (sp of caller)
0x0000000109e7c440: mov DWORD PTR [rsp-0x14000],eax
0x0000000109e7c447: push rbp
0x0000000109e7c448: sub rsp,0x10 ;*synchronization entry
; - com.asserts.Assert::twice@-1 (line 14)
0x0000000109e7c44c: test esi,esi
0x0000000109e7c44e: je 0x0000000109e7c460 ;*ifne
; - com.asserts.Assert::twice@7 (line 14)
0x0000000109e7c450: mov eax,esi
0x0000000109e7c452: shl eax,1 ;*imul
; - com.asserts.Assert::twice@22 (line 15)
0x0000000109e7c454: add rsp,0x10
0x0000000109e7c458: pop rbp
0x0000000109e7c459: test DWORD PTR [rip+0xffffffffff6ecba1],eax # 0x0000000109569000
; {poll_return}
0x0000000109e7c45f: ret
0x0000000109e7c460: mov esi,0x7
0x0000000109e7c465: xchg ax,ax
0x0000000109e7c467: call 0x0000000109e47120 ; OopMap{off=44}
;*new ; - com.asserts.Assert::twice@10 (line 14)
; {runtime_call}
0x0000000109e7c46c: call 0x000000010898fe44 ;*new
; - com.asserts.Assert::twice@10 (line 14)
; {runtime_call}
0x0000000109e7c471: hlt
0x0000000109e7c472: hlt
0x0000000109e7c473: hlt
0x0000000109e7c474: hlt
0x0000000109e7c475: hlt
0x0000000109e7c476: hlt
0x0000000109e7c477: hlt
0x0000000109e7c478: hlt
0x0000000109e7c479: hlt
0x0000000109e7c47a: hlt
0x0000000109e7c47b: hlt
0x0000000109e7c47c: hlt
0x0000000109e7c47d: hlt
0x0000000109e7c47e: hlt
0x0000000109e7c47f: hlt
[Exception Handler]
[Stub Code]
0x0000000109e7c480: jmp 0x0000000109e6bf60 ; {no_reloc}
[Deopt Handler Code]
0x0000000109e7c485: call 0x0000000109e7c48a
0x0000000109e7c48a: sub QWORD PTR [rsp],0x5
0x0000000109e7c48f: jmp 0x0000000109e46d00 ; {runtime_call}
0x0000000109e7c494: hlt
0x0000000109e7c495: hlt
0x0000000109e7c496: hlt
0x0000000109e7c497: hlt
OopMapSet contains 1 OopMaps
#0
OopMap{off=44}
</pre>
Lets remove a lot of the clutter:
<pre>
0x0000000109e7c44c: test esi,esi
0x0000000109e7c44e: je 0x0000000109e7c460 ;*ifne
; - com.asserts.Assert::twice@7 (line 14)
0x0000000109e7c450: mov eax,esi
0x0000000109e7c452: shl eax,1 ;*imul
; - com.asserts.Assert::twice@22 (line 15)
0x0000000109e7c454: add rsp,0x10
0x0000000109e7c458: pop rbp
0x0000000109e7c459: test DWORD PTR [rip+0xffffffffff6ecba1],eax # 0x0000000109569000
; {poll_return}
0x0000000109e7c45f: ret
0x0000000109e7c460: mov esi,0x7
0x0000000109e7c465: xchg ax,ax
0x0000000109e7c467: call 0x0000000109e47120 ; OopMap{off=44}
;*new ; - com.asserts.Assert::twice@10 (line 14)
; {runtime_call}
0x0000000109e7c46c: call 0x000000010898fe44 ;*new
; - com.asserts.Assert::twice@10 (line 14)
; {runtime_call}
</pre>
This one is a bit more complicated. Lets start at `0x0000000109e7c44c` where a 'test esi,esi' is done. 'test esi,esi' checks if 'a' is 0 and updates the 'ZF' (Zero Flag) in the flags register. If ZF is 0, then we continue with the regular logic; we move 'a' in the 'eax' register and multiply it by 2 and then the procedure exit ceremony is executed. This is no different then when assert is disabled.
<p/>
If ZF is 1 (so a==0), then we jump to `0x0000000109e7c460`. We are totally bypassing the regular logic and even fail to execute the exit ceremony for procedure. Everything we have done so far is frozen and then we shoot into space. The mechanism we are seeing is called the uncommon trap and required for speculative optimizations. In this case, we have never called `twice` with 0, so for the JIT there was no reason to emit the 'throw new AssertError' code. As soon as I figured out the handling of this uncommon trap, I'll create a blogpost about it.
<p/>
If we would translate this back to Java, we would get something like:
<pre>
public class Assert {
public static int twice(int a) {
if(a == 0) {
uncommonTrap();
}
return a * 2;
}
}
</pre>
If 'twice' would be called with 0, the uncommon trap gets executed, the code deoptimized and the interpreter will continue execution with the 'throw new AssertError' in place.
<p/>
The conclusion we can make is: if assert is enabled, then the '$assertionsDisabled' check is removed.
<h1>Size matters</h1>
Till so far we have seen that the JIT completely removed all assert logic if assert is disabled. So it looks like assert is completely free if disabled. Unfortunately that isn't the case. The extra instructions inserted in the bytecode, can prevent inlining because inline limits are determined on the size of the bytecode.
<p/>
Lets proof this by inflating the assert logic with some bogus additional conditions:
<pre>
public class AssertFat {
public static void main(String[] args) {
long l = 0;
for (int k = 1; k < 100_000; k++) {
l += twice(k);
}
System.out.println(l);
}
public static int twice(int a) {
assert a != 0 && a != -1 && a != -2 && a != -3 : "bad a";
return a * 2;
}
}
</pre>
We have inflated the size of the assert artificially by adding some additional bogus checks.
<p/>
If we would set the FreqInlineSize to 50, the method will not get inlined even when asserts are disabled:
<pre>
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintInlining
-XX:FreqInlineSize=50
-da
</pre>
In the logging we'll see the following in the inlining output:
<pre>
@ 12 com.asserts.AssertFat::twice (53 bytes) callee is too large
</pre>
This means that the `twice` method, 53 bytes, was too fat to get inlined since the maximum size for a frequently called method was set to 50 bytes.
<p/>
To proof that the twice method would have been inlined without assert, lets remove the assert and run again:
<pre>
@ 12 com.asserts.AssertFat::twice (4 bytes)
</pre>
Now the method does get inlined.
<p/>
This means that even though the JIT is able to completely remove the assert code if assert is disabled, it can still prevent inlining. Therefor the assert feature isn't completely free. It doesn't mean that you should remove assert from your code; but it is good to keep in the back of your mind.
pveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.com1tag:blogger.com,1999:blog-725906748590083291.post-73245296898334140022015-01-04T23:56:00.003-08:002015-01-05T00:00:27.742-08:00JMH and printing AssemblyIn preparation of a blogpost I'm working on, I need to access the assembly generated from a JMH benchmark.<br />
<br />
A few resources are available how to print the assembly for a normal application, e.g.<br />
<ul>
<li><a href="https://wikis.oracle.com/display/HotSpotInternals/PrintAssembly">Oracle PrintAssembly Page</a> </li>
<li><a href="http://mechanical-sympathy.blogspot.com/2013/06/printing-generated-assembly-code-from.html">Printing Generated Assembly Code From The Hotspot JIT Compiler</a></li>
</ul>
But in case of JMH it is a bit more tricky since JMH forks a new JVM and the arguments need to be applied to that JVM; not to the JVM of JMH itself.<br />
<br />
Below is a copy/paste nonsense example of how to do configure JMH to print the assembly of the methods you are interested in:<br />
<pre>
package org.sample;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import java.util.Random;
@State(Scope.Thread)
public class SomeBenchmark {
public static final int ARRAY_SIZE = 1000 * 1000 * 100;
private int[] array;
@Setup
public void setup() {
array = new int[ARRAY_SIZE];
Random random = new Random();
for (int k = 0; k < array.length; k++) {
array[k] = random.nextInt();
}
}
@Benchmark
public int benchmarkFoo() {
int[] array = this.array;
int sum = 0;
int length = array.length;
for (int k = 0; k < length; k++) {
sum += array[k];
}
return sum;
}
public static void main(String... args) throws Exception {
Options opts = new OptionsBuilder()
.include(".*")
.warmupIterations(2)
.operationsPerInvocation(ARRAY_SIZE)
.measurementIterations(5)
.jvmArgs("-server",
"-XX:+UnlockDiagnosticVMOptions",
"-XX:PrintAssemblyOptions=intel",
"-XX:CompileCommand=print,*SomeBenchmark.benchmark*")
.forks(1)
.build();
new Runner(opts).run();
}
}
</pre>
I'm only interested in the benchmarkFoo method, so only the assembly generated for that particular method is printed. I'm using a benchmark* wildcard, so I can easily add other benchmark methods and have their assembly printed.pveentjerhttp://www.blogger.com/profile/17847641595368096163noreply@blogger.com0