Some recent studies show that IT is emitting more CO2 than civil aviation! As software developers, there is something we can do: optimising our code. Making our code run more efficiently should reduce the carbon footprint of IT. It will consume less CPU, RAM, disk, and network. Hardware being powered by electricity, these optimisations should reduce our electricity bills and lower emissions due to generation of that electricity. Optimised software also extends hardware lifespan by reducing the need for upgrades, diminishing the embodied carbon from manufacturing and avoiding raw material extraction (e.g., rare earths mining).
“Yes, it is possible to continue to innovate while keeping existing hardware and without buying new ones. To do this, we can count on software optimisation.”
Tristan Nitot, 2024
Let’s do that then!
The problem: the mountain of code worldwide! How can we optimise all the code in the world? Even targeting key projects (famous open source libraries, frameworks, etc.) there are a lot of those!
But now we have Generative AI. It’s not perfect: many developers practising “Vibe Coding” (AI-assisted coding) complain it takes even more time than regular coding. But if we guide an AI to focus on a small bit of code with a known problem, it can work. If a project has good test coverage and benchmarks, it’s even better because we can verify our AI didn’t break anything and actually improved performance.
That’s what I’ll demonstrate here: can we use GenAI to lower energy use by software? And is it worth it? Will it save more energy than our coding AI will use to optimise it?
⚠️ Disclaimer: For this article, and for the sake of simplicity, we will focus on the impacts of CPU usage of a Java program, and ignore RAM, Disk, Network, etc.
💻 Source code: You can find the full source code of this demonstration in the repo: https://github.com/obierlaire/java-optim-example
Table of Contents
- Optimization Workflow
- Our Case Study
- Challenge 1: Finding a Good Scenario
- Challenge 2: Deterministic Environment
- Challenge 3: OUTPUT a Machine-Readable List of Hotspots
- Challenge 4: Measuring Energy Used
- Challenge 5: Creating a Coding Agent with Development Workflow
- Challenge 6: Translating Energy to Carbon EMISSIONS
- Challenge 7: Estimating Energy Used by AI-Powered Optimisation
- Was It Worth It?
- How Can We Improve the Break-Even Point?
- Conclusion
Optimization Workflow
We cannot simply prompt an AI with “Hey ChatGPT, read this project and optimise it!”. Maybe that could work for a one-file script with obvious complexity issue, but there are several questions:
- What do you call “optimise it”?
- Where are the inefficiencies?
- What is the relevant scenario?
- Will the fix work? Pass the tests?
- Will the fix actually reduce energy used?
Human developers follow a slow but efficient process: they run benchmarks, analyse them manually (flamegraph), pick one issue, try to fix it and enter a loop of fix/build/test/fix… then check if the benchmark shows improved results. Finally, they move to the next issue.
Can we automate this workflow? Probably yes, now that we have coding AI agents, even the “fix the issue” step can be automated. Everything else seems realistically easy to automate with existing technologies: build, benchmark, workflow with decisions, git versioning.

Let’s refine this workflow. To be automated, we first need to create a list of issues/hotspots to optimise, then check if energy consumed is reduced before and after. Once optimised, loop and move to the next issue.

Of course, we SHOULD NOT let the machine unsupervised, even with the best test suite. This process should be done in a separate branch. Humans should always own the code and decide if they accept and merge those AI generated optimisations.
There are limitations to this simplistic workflow, for example:
- It won’t find and fix architecture issues and bad technical design
- It won’t challenge useless functionalities (let’s leave that assessment to humans!)
Our Case Study
To illustrate and do this little experimentation, we’ll pick a “random” Java project. Not so random actually, it should be:
- simple enough for fast development iteration
- able to run a scenario that will put some load on the program
- mature enough to have tests, benchmarks, documentation
- not too mature to leave room for improvement
- real, open source project (writing fake bad code on purpose seems like cheating)
With these constraints in mind, let’s pick https://github.com/rjeschke/txtmark. It seems to check all our requirements for this experiment. The idea here is not finger-pointing: all projects have their flaws, and this choice is really random (I didn’t even know this project before).
📌 To ensure reproducibility, I’ve created a fork. In my fork, you will find the “optimisation” branch that contains the result of that optimisation: https://github.com/obierlaire/txtmark
Challenge 1: Finding a Good Scenario
Static code analysis is great for finding technical debt, quality flaws, and sometimes algorithmic inefficiencies. But these don’t always translate into measurable energy waste. For example, a function with O(n²) complexity may look bad on paper, but if it only ever runs on small inputs, optimising it won’t save energy. In that case, the energy spent on optimising, especially with big multi-purpose AI tools, may not have a good return on investment. To really cut energy waste, it’s interesting to run and profile, with a realistic workload, it should identify “hotspots”.
For profiling, we need to find a scenario should be:
- Realistic enough
- Intensive enough for highlighting hotspots (otherwise inefficiencies might be hidden by regular program overhead)
- Narrow enough: target only one nominal case
- Deterministic: probably the most important. If the scenario contains randomness, the results could vary significantly!
For our case study, txtmark, we are going to focus on the nominal case: a regular text file. To make hotspots shine enough, we’ll run it on the anthology of Shakespeare, a 5MB text file. After trials, that 5MB size was the sweet spot, revealing hotspots clearly while being processed fast enough.
Challenge 2: Deterministic Environment
Now that we have our case study (txtmark) and our scenario (“shakespeare.txt” – 5MB), let’s run it:
$ time java -cp target/classes com.github.rjeschke.txtmark.cmd.Run /workspace/test/shakespeare.txt
...
By whom our heavy haps had their beginning :
Then , afterwards , to order well the state ,
That like events may ne'er it ruinate .</p>
real 0m0.555s
user 0m0.684s
sys 0m0.297s
As expected, it’s processed well, without error, in a reasonable time.
To minimise variability, we need to minimise environmental noise:
- Isolated environment. Ideally a dedicated physical machine, but here, running within a Docker container was be good enough
- Java running flags, to limit unpredicted garbage collection:
- Massive heap (
-Xmsand-Xmx) - SerialGC for deterministic behaviour (single-threaded, predictable):
-XX:+UseSerialGC
- Massive heap (
- System settings: CPU pinning (
taskset -c 0 java ...)
Challenge 3: OUTPUT a Machine-Readable List of Hotspots
Once we have a stable environment, let’s profile it. While there are many profilers available, to feed an AI, we want that profiler to generate a report showing:
- Location: file path and line number
- CPU percent
- Memory, Disk, Network (we will ignore those for this article)
JFR (Java Flight Recorder) is available by default in OpenJDK. We just need to instruct Java to run with JFR and write the report to a file (profile.jfr):
$ java -cp target/classes \
-Xint -Xms6144m -Xmx8192m -XX:+UseSerialGC \
-XX:+UnlockCommercialFeatures -XX:+FlightRecorder \
-XX:StartFlightRecording=duration=30s,filename=/workspace/results/profile.jfr,settings=profile,disk=true \
com.github.rjeschke.txtmark.cmd.Run \
/workspace/test/test.txt
But .jfr file produced won’t be directly usable by our AI, we need to extract hotspots from it. To do so, let’s write a custom JFR parser that produces a JSON report listing with hotspots (see our custom code for details):
- Running
jfr print --events jdk.ExecutionSample - Parsing stack traces for the target package
- Counting samples per source line
- Converting to CPU percentages
- Outputting structured JSON with hotspots ranked by CPU usage
The resulting file looks like:

Challenge 4: Measuring Energy Used
With similar constraints to profiling, we want to measure the energy used by the program so we can compare before/after optimisation.
Ideally, you would run your program on an isolated machine, plug in a watt-meter, and actually measure energy consumed. But that solution is difficult to scale. Fortunately, there are many tools we can use to estimate the energy used by our program:
- CodeCarbon – lib, estimate energy & CO₂
- pyJoules – lib, measure via RAPL/NVML
- pyRAPL – lib, Intel RAPL wrapper
- AMD uProf – CLI/GUI, AMD power profiler
- Kepler – exporter, K8s energy metrics
- etc.
The main challenge with these tools is that they need privileged access to know CPU model, read energy endpoints, etc. But do we need this accuracy? Even with a rough estimation of energy used, if after optimisation it’s halved, that’s enough to demonstrate improvement.
CodeCarbon meets our needs: it tries to determine the CPU model and get accurate samples, but if running in a Docker container under MacOS Silicon, it will fall back to estimates. It also estimates carbon emissions. CodeCarbon is a Python library, so we need a minimal script to run it and format the result:

To control variance and get an average, our custom script run it multiple times :

This shows that txtmark used an average of 488 J (0.135 Wh) of electricity on my machine to parse our 5MB shakespeare.txt file. That’s our baseline. It has a standard deviation of 8%, which means if optimisation yields less than 8% improvement, the results will be inconclusive.
Challenge 5: Creating a Coding Agent with Development Workflow
At this stage, we have:
- A benchmark scenario (our 5MB
shakespeare.txtfile) used for profiling and energy measurement - A profiling report showing a list of hotspots
- An energy report
Now, we now need a coding AI. There are many options for LLMs capable of coding. But having an LLM endpoint isn’t enough: we need an agent that can interact with our development machine to read/write code, run builds, tests, and trigger benchmarks.
This can be solved using tool agents (or MCP) with a workflow to orchestrate:
- Reading the profiling report file
- Reading source code
- Writing updated code
- Inserting the updated code in the file
- Building
- Running the tests
- Looping: rinse and repeat in case of failure
For this case study, let’s use “Claude Code“. It’s not open source and is not free at all, but it’s easy to install and meets our requirements. With the right permissions, when we write a prompt, it can plan actions, code, and execute shell commands on your machine.
What should we prompt? Let’s break it down into two parts:
1 – Analysis prompt:
Analyze the Java project in target/ directory only. Write analysis to results/analysis.md.
Cover: project structure, dependencies, code quality, main functionality, tech stack, recommendations.
2 – Optimisation prompt:
# Always read results/analysis.md to understand the project
Look at results/hotspots_profile.json and fix the code in target to make sure those hotspots are optimised. The code should of course run well after that. Optimise only the first 3 hotspots.
Run the tests and fix any error.
Note that I’m limiting it to 3 hotspots. Depending on the hotspots and the project, it might be better to batch hotspots or optimise them one by one.
You can use Claude Code interactively or as a CLI. For this use case, let’s use the CLI in non-interactive mode:
$ cat tools/prompts/analyse.txt | claude -p --output-format text
Analysis complete. I've written a comprehensive analysis of the Java project in the target/ directory to results/analysis.md covering:
- **Project Structure**: Txtmark markdown processor with standard Maven layout
...
The first step generated a thorough analysis:

The second step takes a few minutes. Tip: you can monitor progress with claude --continue in another shell:

Once done, it outputs a summary:

Note that the tests are still passing. We can see the files that have been modified:


Now, let’s run the scenario with our energy tracker based on CodeCarbon:

That’s much better, remember our baseline was 448 J and now it uses only 273 J to process the same file:
| Metric | Before | After | Difference (%) |
|---|---|---|---|
| Energy | 488 J (0.135 Wh) | 273 J (0.076 Wh) | −44.1% |
| Time | 10.74 s | 6.02 s | −44.0% |
| CO₂ | 0.052 g | 0.029 g | −44.2% |
Energy consumed have dropped by 44%! Since 44% is significantly more than our standard deviation of 8%, we can confidently call this a successful improvement!
Challenge 6: Translating Energy to Carbon EMISSIONS
We’ve focused so far on the energy used by our program, not carbon emissions. Translating energy use to CO2 is challenging because it depends on various factors:
- Your electricity provider’s sources
- Time of day and weather: solar/wind electricity production varies significantly
- Region/country: energy mix varies greatly between countries
A useful online resource for checking the carbon intensity of the grid in real-time is: https://app.electricitymaps.com/

Fortunately, tools like CodeCarbon can perform these calculations. Based on CPU usage and model, it calculates energy consumed and, using your region’s average carbon intensity, estimates the carbon footprint.
But limitations exist if:
- Your program runs in the cloud: do you know if the data centre use “green” electricity ?
- Your program runs on customer devices: what devices do they have, how many customers?
Even if it’s hard to estimate this carbon footprint, if our optimisation reduced energy consumption by 44%, that’s still a 44% reduction of carbon emissions, regardless of location and hardware (with some edge case exceptions).
Challenge 7: Estimating Energy Used by AI-Powered Optimisation
To determine if our optimisation was worth it, let’s compare:
- Carbon emissions from AI-powered optimisation
- Carbon savings thanks to our optimisation
A generative AI LLM consumes energy in two phases:
- Training and fine-tuning
- Inference (running) time
Like renting a car, you pay for fuel (usage) but also a portion of the car’s retail price. The energy used for training needs to be prorated across all requests during the LLM’s lifetime. Rule of thumb: to minimise training energy, choose smaller LLMs trained on energy-efficient machines in regions with green electricity.
For inference time, there are two extreme cases (with a lot in between):
- You have your own LLM hosted on your dedicated server with a watt-meter: ideal!
- You’re using a commercial hosted agent like Claude Code or Chat GPT.
Many factors affect how much energy an LLM needs to process your request:
- The “size” of the LLM (number of parameters)
- The number of tokens (roughly “words”) in your request
- How the LLM is hosted (region, grid, hardware, …)
- The “efficiency” of the LLM
A useful tool to estimate carbon intensity of major Gen AI providers is: EcoLogits Calculator.

To determine how many tokens we used with Claude Code, in interactive mode, you can see the cost and token usage with /cost at the end of a session. In non-interactive mode, you can request machine-readable output:
claude -p --verbose --output-format stream-json < prompt.txt | tee usage.json
![Example of token usage metrics from Claude Code showing input/output tokens for individual operations]](https://javapro.io/wp-content/uploads/2025/08/14_claude_token_usage_example-1-1024x954.png)
Unfortunately, this format doesn’t directly give total tokens. When we send a prompt to Claude Code, it creates a conversation session, and each interaction, reasoning step, and tool call has its usage traced in its own JSON line. Check the full usage log of our optimisation session. A custom groovy script has been written to aggregate the token usage and produces the following summary :

EcoLogits calculator only accounts for output tokens, a known limitation. Both our analysis and optimisation requests generated 9,016+8,890 = 17,906 tokens. In our case, we used Anthropic, Claude Sonnet 3.5:

This gives us 1.57 kWh of energy for our analysis and code optimisation, equivalent to about 100 iPhone 16 charges or 10 km with a small electric vehicle !
To translate this into carbon emissions, we need to know where Claude is hosted. For simplicity, let’s assume both Claude servers and our experiment both run in Germany.
EcoLogits and CodeCarbon use different sources for carbon intensity data. For consistency, let’s use CodeCarbon’s figure for Germany: 0.3809 kgCO2e/kWh.
1.57 kWh × 0.3809 = 0.60 kgCO2e
Our analysis and optimisation prompts generated 0.6 kg of CO2e, equivalent to 3-4 km by petrol car or 2 cups of coffee.
Was It Worth It?
We’ve seen that running txtmark on the Shakespeare anthology initially took 488 J (0.135 Wh), then 273 J (0.076 Wh) after optimisation, each run saving 215 J (0.059 Wh).
Let’s work backwards to find the break-even point: “How many large text files we need to parse with txtmark to offset our AI-powered optimisation ?“:
1570 Wh ÷ 0.059 Wh = 26,610
To have a positive return on investment, we would need to parse at least 26,611 huge text files with txtmark. Without knowing how widely this tool is used, it’s hard to determine if this is realistic.
How Can We Improve the Break-Even Point?
Let’s consider CO2 emissions:
- Carbon emissions of our optimisation:
0.60 kgCO2e - Carbon savings per run:
0.000023kgCO2e
Option 1: Choose locations wisely. If your AI runs on machines powered by low-carbon electricity, the break-even point decreases. France’s carbon intensity is around 44 gCO2/kWh, so our optimisation would emit 1.57 kWh × 0.044 = 0.07 kgCO2e. If we still run our program in Germany, the break-even becomes 0.07 ÷ 0.000023 = 3,003. Much better!
Option 2: Use smaller, more energy-efficient LLMs. According to EcoLogits, our 17,906 output tokens could have used:
| Model | Energy (Wh) | Energy (Germany, kgCO₂e) | Energy (France, kgCO₂e) |
|---|---|---|---|
| Anthropic / Claude 3.5 – Sonnet | 1570 | 0.604 | 0.069 |
| Anthropic / Claude 3.5 – Haiku | 92 | 0.035 | 0.004 |
| OpenAI / ChatGPT 4o | 1570 | 0.604 | 0.069 |
| Mistral / Codestral 22B | 103 | 0.040 | 0.004 |
| Meta / CodeLlama 70b | 223 | 0.086 | 0.010 |
| Google / CodeGemma | 67 | 0.026 | 0.003 |
The best combination of location/model is CodeGemma in France. With that choice, the break-even would be 0.003 kgCO2e ÷ 0.000023 kgCO2e = 130. The optimisation would have a positive return on investment after just 130 runs: realistically achievable!
Using AI to optimise Java code is feasible. This demonstration has limitations (it won’t optimise architecture or design), but inefficiencies were discovered automatically and optimised entirely without human intervention! But keep in mind it’s still the developer’s responsibility to accept or reject code produced by AI.
Conclusion
By methodically addressing each of those seven challenges, we’ve created a repeatable automated process that can be applied to other codebases in Java, with potentially significant energy savings.
This remains an uncharted territory, and the technique described won’t work in every case. For such an exercise to succeed, you need:
- A profiler that can generate hotspot reports
- A good intensive benchmark scenario
- A coding AI tool agent that can run local shell commands
- An energy tracker (like CodeCarbon)
- Knowledge of where your coding AI runs to choose the “greenest” AI solution
- Understanding of your program’s usage
Most importantly, do the maths and estimate if the optimisation is worthwhile. As with human developers, you don’t want to spend significant resources optimising rarely-used code, at least for financial reasons. If you have good confidence your AI-powered optimisation will save more CO2 than it consumed, then go ahead and reduce our carbon footprint, one line of code at a time !

This article is part of the magazine issue ’Java 25 – Part 1′.
You can read the complete issue with all contributions here.