Did you try turning it off and on first?
(I promise this is the last 1BRC-related post). Read on to discover how large language models (LLMs) performed in generating Java code, their strengths in context-aware development, and challenges such as hallucinations and nuanced errors. Learn what makes a good AI dev tool and what makes a great IDE integration.
A definition of insanity is repeating an action and expecting a different outcome. As developers, we’re used to doing that 🙂 It’s part of our operating manual.
So, it is no surprise that I tried again to get an AI to generate valuable and workable code.
To be fair, things have changed. It’s not too insane to do this again 🙂
Some LLMs have moved on, and I gained longer access to IntelliJ’s AI Assistant (thanks to Helen at IntelliJ). I also stumbled on Zencoder, a new AI coding tool with great promise.
The results from this time are what you’d expect, but I learnt some new things I want to share. The generated code can be found here if you want to see the gory details.
part 1 – New LLMs on the block
If you want to try this at home and see if the code works for you, or if you want to try your own, it’s pretty straightforward to get started. Let me know, and I’ll explain. You can write the how-to guide afterwards. 🙂 Each Provider/LLM has a directory with the generated code and a context.md file where I list any extra LLM-provided doc, any notes from me as we progressed, and other comments.
There are more providers/LLMs to try this time, so the table is longer. The generation results were again mostly the same. I recorded that some produced data in incorrect format or order, but the problems were trivial and easily fixable. The ‘failed to compile’ honours went to ChatGPT4o, CodeConvert and Gemini 1.5Flash, but nothing too shocking.
Minimal hallucinations
ChatGPT4o hallucinated a couple of methods, while CodeConvert showed how LLMs can get the nuances of Java wrong :
“local variables referenced from a lambda expression must be final or effectively final”
Gemini 1.5Flash, similarly, got confused over the result of a stream, but the fun was more in the beginning. When given the prompt, I got back
“ I’m just a language model, so I can’t help you with that.”
I’m using Google’s Gemini as a general tool, but as a code generator (at least in this instance), it’s not been happy to play ball. The first try gave the response above. When nudged, it generated code.
Generation Results
I assessed the AIs for accuracy and time and recorded if they produced any extra docs. I ran them using Java 23 and did multiple passes. As you can see, there were errors in various places, but nothing too dramatic.
This table slides left to right.
Provider/ LLM | Compiled | Doc | Ran Sample Set | MATH | Format | Order | Ran 1BRC |
---|---|---|---|---|---|---|---|
Amazon Q Developer | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
ChatGPT4 legacy | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
ChatGPT4 o | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
ChatGPT4 oMini | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
ChatGPTo1 | ✅️ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
ChatGPT oMini | ✅️ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
Claude | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
CodeConvert | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ |
Gemini 1.5 Flash | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
GitHub CoPilot | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
Google Vertex Gemini 1.5 Flash | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
Grimore | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
JetBrains AI Assistant | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Microsoft CoPilot | ✅ ️ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
Zencoder | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
Performance
I ran several combinations of tests to stress the code. The usual failures were OOMs or thread pool exhaustion. For the 1BRC run, two failed, ChatGPT4Legacy and CodeConvert, with thread pool exhaustion problems. ChatGPT4o completed 1BRC but had issues with running more minor test sets multiple times. For the results below, I’ve excluded the examples that are failing.
‘Base’ is the time it took the 1BRC-provided sample to complete.

Conclusions?
Those results, slower than the baseline, seemed to suffer from memory availability problems, and although most of the generated code was faster than the baseline, comparing it to the competition results showed how poor the showing was. The winner was 100x faster.
I’m cool with that because what it shows is not whether the AIs are good at generating fast code. It shows that the AIs can generally create about average code. Given that all these LLMs are trained on ‘the internet’, we see where the low bar is.
We know what an LLM can do with minimal guidance. What’s essential to understand is what we can do to improve the code they generate. Many of the LLMs gave me code and documentation. Zencoder and JetBrains AI Assistant could create test cases.
Remember, LLMs are statistical reasoning engines. They rely on probabilities and patterns. Symbolic reasoning, the ‘value add’ we bring as humans, involves manipulating symbols and rules. Through their teaching using vast amounts of data, LLMs are good at the former but poor at the latter. Unfortunately, while commercial large-scale LLMs can mimic symbolic reasoning enough to be often correct, there will be nuances at all levels. Leading to hallucinations that can be glaringly obvious or incredibly subtle.
LLMs and programming is not a perfect fit.
Next Steps
With AI dev tools that use LLMs (ie all of them at the moment), the value comes in the assistance we can get beyond code generation.
The bottom line is context, parsing, and integration. By that, I mean the three things that can separate a developer AI assistant from a run-of-the-mill (am I jaded already?) AI chatbot.
PART 2 – Curiouser and Curisoser
Context
All the tools I’ve used can generate code, help design systems and critique existing code. The AI will do a reasonable job if you stay small, provide snippets of code to the AI and ask for limited assistance. However, to expand beyond that, the tool needs more context. You need to be able to use the AI as an RAG over your code. You need a chatbot/code suggester to know your existing code and codebase. Otherwise, you’re relying on the training the LLM had in general. Without access to your code and a specific, unique context to connect, any answer is from the training data, not your codebase. Note that even with this context, there is still a chance that the LLM’s training data overwhelms your additional information.
A good AI dev tool will help you connect relevant code bases to the LLM and allow you to include guidance you want to apply to the project and/or the organisation.
Parsing
Even with a RAG-like approach, it has to be designed to maximise context. Reading source code as plain text is quite possible, but the better approach is to parse the source code intelligently and create a suitable vector for the AI to integrate. Without this level of code analysis, the AI is hampered and limited in what it can do for you in code-specific ways. Programming languages like Java often have subtle edges that LLMs will struggle with. Implicit behaviours, where some small but essential context needs to be understood, can easily be drowned out by the LLM’s general training.
Look for AI dev tools that specifically parse the code into language-specific vectors. Tools that treat code just as text are non-optimal.
Integration
How do you interact with the AI in the IDE? Where are the buttons and controls, and what do they do? If all you have is a Chatbot panel that leads back to a cloud-based LLM, then you’re missing out on the potential of having an AI pair programmer that can improve your productivity.
With an AI that is tightly integrated and easy to use (and has good context with your codebase), then the magic can happen. Whether creating unit tests, generating docs or helping grok the code, a seamless experience is essential. Clunky integrations are frustrating at the least and disruptive at the worst.
Take time to explore the AI integration with your IDE. How easy is it to generate code and ask code-specific questions? Does the AI learn your style, etc? Could you imagine this tool being your pair programmer? Can you use keyboard shortcuts and be productive, or do you have to switch contexts to interact?
Final thoughts
As I said at the beginning, I’m done with trying this code-gen approach. I plan to use AI dev tools as assistants for some new projects, and I’ll let you know how that goes.
Which TOOLS To USE?
I have used LLMs directly and tried Amazon Q Developer, CoPilot and other AI dev tools, and it’s a mixed bag as far as the criteria above are concerned. There are two that stand out, though. That’s Zencoder and JetBrains AI assistant.
In a later article, I also plan to talk to at least JetBrains and Zencoder about their AI IDE tools and what these companies are working to achieve and explore the future. What can we expect in the AI dev tool space over 2025?
In the meantime, all I can suggest is you give them a try. Both have good context awareness and good integration. They are not identical in capabilities, so it’s worth trying both.
More from me here