
Introduction
Java’s bytecode is made abstract to be portable and neutral from specific hardware. Yet in production, it performs with the precision of native code. How? The secret lies in the JIT compiler, a near magical blend of static analysis and real-time insights. It reinterprets, restructures and often reinvents the code based on the runtime behaviour. This article traces that silent transformation using the Semeru (OpenJ9) JIT, following a method from source to bytecode to intermediate representation and ultimately to highly optimised machine code. You will be amazed to see how your logic is optimized beyond recognition. What you write is not always what runs, and what runs is often much better version of what you imagined.
The JIT plot
The JIT compiler speeds up Java programs by translating bytecode into native machine code at runtime. Methods don’t compile right away. They start in the interpreter, and as the JVM tracks their usage, it classifies them as cold, warm, hot or scorching. This ensures heavy optimization is applied only to methods that are performance critical. Once a method is chosen, its bytecode is transformed into an internal tree like representation, closer to machine code. The JIT compiler then optimizes these trees for example by simplifying expressions, removing redundant work, improving loops and analyzing value usage before finally emitting highly optimized native code for execution.
Pre-requisite learning
Before diving into the topic of JIT compiler optimizations, let us quickly learn a few foundational concepts. These include how the JVM runs code via the interpreter, what bytecode is and how it is specified, the role of the Java compiler in producing that bytecode, how JIT compilation accelerates execution and the idea of basic blocks that form the building blocks for optimization.
JVM Interpreter: executes Java bytecode instructions one by one (like a virtual CPU) when JIT is not used. Reference: https://docs.oracle.com/javase/specs/jvms/se21/html/jvms-2.html
Bytecode Specification: defines the exact format and behavior of JVM bytecode instructions so they are portable across all JVMs. Reference: https://docs.oracle.com/javase/specs/jvms/se21/html/jvms-6.html
Java Compiler (javac): translates Java source code into JVM bytecode that can be executed by the JVM. Reference: https://docs.oracle.com/javase/specs/jls/se21/html/index.html
JIT Compilation: Just-In-Time compiler translates hot (frequently used) bytecode into native machine code for faster execution. Reference: https://www.ibm.com/docs/en/sdk-java-technology/8?topic=reference-jit-compiler
Basic Blocks: a straight-line code sequence with no jumps except into the block at the beginning and out at the end; fundamental unit for optimizations. Reference: https://en.wikipedia.org/wiki/Basic\_block
The JIT in action
In this article, we use a simple Java method and highlight ten out of the many optimizations it undergoes during compilation, demonstrating how the bytecode changes in the process. While covering all optimizations would require much more space, we focus on selected examples. For each optimization, we show the bytecode before and after, along with a brief explanation. In addition to these we pickup some important optimizations and provide only the explanation. The bytecode sequences are presented primarily for illustrative purposes and may not precisely reflect the actuals in the generated code.

Above is a simple Java method compute, which iterates over an array, checks for even numbers and accumulates their squares into a running sum. While the logic is straightforward, it is rich enough to trigger several compiler optimizations, making it an ideal example to illustrate how bytecode evolves during JIT compilation. We could have chosen a more complex method to showcase additional optimizations, but that would compromise readability. Instead, we aimed for a balance, keeping the code simple enough to follow while still complex enough to provide the compiler with sufficient opportunities for optimization.
Original bytecode (output of javac):

Method Inlining
Inlining is an optimization that replaces a method call with the actual code of that method at the callsite. This removes the extra steps of copying the arguments of the target method, preparing for the call, jumping to the method and coming back, making the program run faster and more efficiently. It works with a space time tradeoff: the more you inline the better, but more time you spent in the inlining, plus the code grows real big.
Before opt:

After opt:

As you can see the call to the static method square at pc 22 has been replaced with the method body of the target itself and seamlessly integrated into the caller. This is not mere copy pasting. The callee’s local variables must be reconciled with the caller’s arguments, memory slots redefined, and the pc indices adjusted and expanded accordingly.
Value Propagation
Value propagation is the process of tracking how variables are initialized, assigned and used. When the compiler detects values that are fixed or predictable, it substitutes those directly into the code, eliminating unnecessary variable lookups that need memory loads. It has two variants:
Local Value Propagation
Local value propagation works within a single basic block.
Before opt:

After opt:

In the compute method, the loop pre-header initializes the loop by setting i = 0 and handling any range-check setup for the array. The lmul, lsub, and i2l operations are internal optimizations used by the JIT for index calculations and efficient bounds-checking. After optimization, the compiler replaces the entire sequence of operations (lconst -16, lconst 4, iconst 0, i2l, lmul, lsub) that always compute 16 with a single instruction 5: lconst 16.
global Value Propagation
Global value propagation works across the entire compilation unit, not just one basic block. It tracks how values flow through multiple blocks and uses that information to simplify operations, remove redundant checks , and optimize the code more thoroughly. NULLCHECKS are the most common ones that get removed through this optimization.
Partial Redundancy Elimination
Partial Redundancy Elimination is a technique that reduces or removes repeated computations by calculating a value once, storing it, and reusing that stored result whenever needed.
Before opt:

After opt:

In the original code, the array length was calculated multiple times, but after optimization, it is computed once, stored in a local variable, and reused wherever needed.
Common SubExpression Elimination
Common Subexpression Elimination eliminates repeated calculations. It detects expressions or operations that produce the same result, computes the value once, stores it in a temporary variable, and reuses that value wherever needed. This, much like value propagation optimisation, also has a local and global variant.
Before opt:

After opt:

The optimization reuses the value already stored in variable 10 instead of loading it again from slot 11. This removes redundant loads.
Compact Null Checks
Compact Null Checks optimization removes repeated null checks by reusing the result of an earlier check. Once an object is confirmed to be non-null, later accesses to that object do not perform additional null-check instructions.
Before opt:

After opt:

Before Compact Null Checks optimization, the bytecode performs two null-checks on a: once before calling a.length() and again before adding 10 to sum. After the optimization, it skips the second check and uses the result of the first check (the X!=0 flag) to safely execute both operations under a single null-check.
OSR Guard Insertion
On-Stack Replacement (OSR) guards help the JVM switch safely from fast compiled code back to slower interpreted code when something changes at runtime. The JIT compiler assumes things (like methods won’t change), and if those assumptions break, it could crash. So, the JIT adds checkpoints (OSR guards) in the code to watch for these changes. If something changes, the OSR guard helps jump back to safe code, continuing the program without restarting. This helps taking calculated risks and optimising the code under a guard.
Basic Block Extension
A compiler optimization that helps the JIT compiler see and optimize more code together by treating two or more basic blocks as a single extended block. Normally, each basic block is optimized separately, but sometimes a value stored in one block is used immediately in the next. If the compiler keeps these blocks separate, it can miss chances to optimize. By extending a basic block to include the next one, the compiler can treat variables as if they’re in the same block, allowing it to eliminate unnecessary loads or computations. This enables optimizations like localCSE to work more effectively, producing faster and cleaner code.
general Loop Unroller
The generalLoopUnroller is a compiler optimization that makes loops run faster by reducing the overhead of repeatedly checking the loop condition. It does this in two ways: unrolling and peeling. Unrolling means repeating the loop body multiple times in one go, so fewer checks are needed. Peeling means handling the first few iterations separately, which helps clean up edge cases and makes other optimizations, like reusing repeated calculations, more effective.
Before opt:

After opt:

Initially, the loop increments the counter by 1 and runs 10 iterations. After optimization, the loop increments by 2 and performs two iteration’s work in a single pass, cutting the number of loop checks in half. This reduces conditional check overhead and makes the loop run faster while keeping the logic the same.
idiom Recognition
Idiom recognition detects common, frequently used code patterns (idioms) during compilation and replaces them with more efficient, low-level operations. This improves performance by avoiding repetitive high-level execution and leveraging optimized platform specific implementations.
Before opt:

After opt:

In this example, the optimisation detects a common pattern where a loop copies elements one by one from src[j] to dst[j]. Instead of executing the loop repeatedly, it replaces the loop with a single, efficient arraycopy operation, which uses highly optimized native code.
Loop Strider
LoopStrider improves loop performance by minimizing repeated memory address calculations. Instead of recomputing the full address of array elements in every iteration, it introduces a helper variable (the strider) that holds the current memory position and increments or decrements it by a fixed stride each time. The compiler can simplify complex address calculations, keep the strider in a CPU register, and insert initialization and increment code around the loop, reducing overhead and enabling faster execution.
global Dead Store Elimination
Global Dead Store Elimination detects stores to variables or memory locations whose values are never read later in the program and safely removes them.
Before opt:

After opt:

Before the optimization, the code multiplies two values and stores the result in a temporary variable, then immediately loads it to perform an addition. This unnecessary store to the temporary variable is removed, as its value is never used independently and afterwards.
General Store Sinking
General store sinking delays writing a variable’s value to memory until it is actually needed. Instead of storing the value immediately after it’s computed, the store is moved to a later point in the code, improving memory access performance in some platforms.
Before opt:

After opt:

In this example, the array length is calculated early and stored in a variable, even though it’s not immediately needed. After optimization the storage of the array length is delayed and moved closer to where it’s actually used. This reduces unnecessary operations and improves efficiency by keeping the store operation only where it is required.
Tactical Global Register Allocation
Tactical Global Register Allocator (GRA) assigns frequently used variables to CPU registers across multiple basic blocks of a method. By keeping values in registers instead of accessing memory repeatedly, it reduces load/store operations, improves execution speed, and enables more efficient use of the processor.
Live Range Splitter
LiveRangeSplitter breaks a long-lived local variable into multiple short lived variables confined to regions of their usage, decreasing the register pressure.
Before opt:

After opt:

Originally, the value of x is stored and kept alive in a register from its first computation until it is later used for z, which keeps the register occupied for a long time. After optimization, x is recomputed when needed for z and stored in a new variable. This splits the lifetime of the value into two shorter ranges, freeing the original register for other variables and reducing register pressure.
With all these optimizations in place, let us look at how the generated binary appears:

At first glance, it may seem huge! Much larger than the original bytecode produced by javac. But remember, bytecode is not a native instruction, each one is a command to the JVM. Executing it requires interpretation, which itself expands into many machine level instructions. Without optimization, a direct translation of bytecode into native instructions would be much larger than what we see here.
Proof:
Time taken to execute this function without JIT: 575 ms
Time taken to execute this function with JIT: 34 ms
Conclusion
JIT compiler optimizations make Java programs faster and more efficient by reducing unnecessary work and focusing CPU effort on the hottest code paths. Unlike static compilation, JIT optimizations adapt as the program runs. This dynamic process transforms portable Java bytecode into highly optimized machine code that performs like native while preserving safety and portability. Ultimately, JIT compilation ensures Java programs run leaner and faster, showing that what you write is not always what runs, and what runs is often smarter than you imagined!
So what is the message to Java developers? Understanding the optimisations that happen under the hood serves two purposes:
- Some transformations are expensive for the JIT to infer at runtime. Knowing them allows developers to shape code in ways that save both time and space upfront.
- Others are purely runtime driven, triggered by dynamic behaviour invisible at the source level. Here, the developer’s role is not to intervene and rewrite their source, but to appreciate how much smarter the JVM becomes as the program runs.

This article is part of the magazine issue ’Java 25 – Part 1′.
You can read the complete issue with all contributions here.