DE EN

1 Billion RECORD CHallenge – AI Style

Steve Poole

Previously …

I recently wrote about my experiences with getting AI coding tools to find bugs and write tests.  I picked five different tools. The selection criteria were simple – free or trial with no credit card needed.  I chose three tools designed to help developers and two general-purpose LLMs.  

I gave them a 1000-line Java program and asked them to do three things.

  • Add Javadoc to explain what the code did
  • Analyse the code for bugs and design weaknesses
  • Create Junit testcases that would provide reasonable code coverage.

The results were disappointing but, in hindsight, predictable.  Overall, the tools were better at creating documentation than finding bugs and, apart from one, were pretty poor at writing test cases.  

This is understandable because general-purpose LLMs like ChatGPT or Claude.ai (the two I chose for the experiment) can read the code as text and are very good at producing human-readable (and understandable) text.  They may make a mistake in the wording and even hallucinate some details.  But at this first stage, they are not being asked to create precise, compilable code.  I may forgive them for the English, the compiler won’t forgive them for incorrect code. 

 Text is Better

Asking the tools to return textual interpretations works significantly better than asking for technical, compilable code. Whether it’s helping to understand the code, find bugs, or identify design weaknesses, LLMs make great code reviewers – but so far, not great coders.

Odd one out

Quite amazingly, one tool, JetBrains AI Assistant, actually produced compilable unit tests. The code it produced wasn’t particularly useful, but it compiled and looked like a test case! Given the failure of all the others to produce compilable code, it was a massive win for JetBrains. 

Not giving up 

These tools showed the initial strengths and weaknesses of AI tools in this space.  The LLMs are certainly worth using as code reviewers and code explainers right now, but I want to see what it takes to get closer to my original concept:   Write docs, find bugs, and write tests.  If JetBrains AI Assistant is any measure, then it may be possible. I have to take a different approach. 

Next Steps

First, I’m going to use AI to its full advantage. That means I’ll interact with AIs and work with them to achieve the goal. Rather than treating AIs as a black box, single-shot mechanism, I’ll try to have conversations with them about how to get the needed output, and if I learn something from talking to one AI, I’ll revisit earlier ones.  I’ll also set a more straightforward goal from the code point of view.  Hence, the 1BRC reference.  Intrigued?  Read on.

Billion Records Challenge 

If you’re a developer, especially a Java dev, I’m sure you saw this from Gunnar Morling: a challenge to write a program to read one billion records and produce a simple statistical summary.  Gunnar’s challenge is to create the fastest code you can. I aim to get the AIs to create code that compiles and works. Performant code will be an unlooked-for bonus. 

Getting the Prompt right

I started with ChatGPT and iterated a prompt.  It’s mostly from Gunner’s blog but with a few extras at the end to help be specific about a few things.

Your mission, should you decide to accept it, is deceptively simple: write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station. There’s just one caveat: the file has 1,000,000,000 rows!

The text file has a simple structure with one measurement value per row:

Hamburg;12.0

Bulawayo;8.9

Palembang;38.8

St. John's;15.2

Cracow;12.6

…

Note that records that begin with a hash (‘#’) are to treated as comments and ignored.

The program should print out the min, mean, and max values per station, alphabetically ordered like so:

{Abha=5.0/18.0/27.4, Abidjan=15.7/26.0/34.1, Abéché=12.1/29.4/35.6, Accra=14.7/26.4/33.1, Addis Ababa=2.1/16.0/24.3, Adelaide=4.1/17.3/29.7, ...}

The goal of the 1BRC challenge is to create the fastest implementation for this task, and while doing so, explore the benefits of modern Java and find out how far you can push this platform. So grab all your (virtual) threads, reach out to the Vector API and SIMD, optimize your GC, leverage AOT compilation, or pull any other trick you can think of.

There’s a few simple rules of engagement for 1BRC 

Any submission must be written in Java

No external dependencies may be used

There must be a comment to identify the version of Java being used.

The input file name will be provided as the first argument on the command line

Make sure all Java code includes all necessary import statements.

Give the class a one word package name based on  your name

Include Javadoc and other comments to explain what the program does

If at first you don’t succeed. 

The first set of code generated by ChatCPT was almost good enough, but I added the last five items to the prompt to get the code in the form I wanted.  A package name to help me keep the AI output separate, a nod at providing doc, a reminder to make sure all the imports for classes were present (a bit hit and miss at the beginning) and that there should be a comment to identify which version of Java was being used. That last was because the AI told me the version info on each loop round and it would change.  I had code using Java 15 and then Java 22.

After a while, I had this from ChatGPT: functional code that compiled. I had a large test file to hand, so I created a quick test case and ran it. I got reasonable output.

The next AI on the list, Codeium, produces even better results. Bonus: while ChatGPT interpreted the request to specify a package based on “your name” literally, Codium was inventive and called itself “Kumar “!  

Next was Amazon Q Developer.  The first time I gave it the prompt I received this back:

 “I apologize, but your request seems to be outside my domain of expertise. However, I'm happy to try discussing related topics that I may have more information on. How can I help further our conversation productively?”  

Since Amazon Q Developer obviously didn’t want to accept the mission, I dropped the first few sentences from the prompt and tried again.  The result was, yet again, compilable and runnable code.

My free licence with JetBrains AI Assistant had expired, so I looked around for something else. I had just got access to Google’s Gemini, so I gave it a go. The results were initially disappointing: uncompilable code. 🙁 Eventually, though, we got some useful results. I tried Microsoft CoPilot, and AI honour was restored: compilable, runnable code with output first go!. Though MS CoPilot has no sense of humour and used “yourname” as a package name. 

Down the rabbit hole

Unlike my first foray, the AIs generated all the code. By interacting with the AIs, I could get the code to compile and produce the right answers. Revisiting my sample code and attempting to get better doc or JUnit test cases failed repeatedly.

Why did this second attempt work better?

The base problem is quite simple, making the code more straightforward.  Is the issue more about scale?  Did I exceed the AI’s memory, or is it just that a simple one-line prompt as a starter is too little guidance for an AI, even when it might seem a reasonable request to you and me?  I did take another look at the Google Gemini AI and tried to get it to fix the code problems but couldn’t make any headway in fixing its fixation with a non-existent class.   Much as this was a failure, it did highlight again the power and weaknesses of these tools and that they are not all the same. I fixed the code myself to get it compiled.

Before we draw conclusions, let me complete this story.

I used ChatGPT to generate a small number of test records and predict the output format. (plus I did the sums myself)  It  was correct, so another win for the AI.

Using these test records against the generated code, I was pleased to see the following:

Codeium100% Accurate
Amazon Q Developer100% Accurate
Microsoft CoPilot100% Accurate
OpenAI ChatGPT100% Accurate
Google Gemini100% Accurate

As I’ve said, I’m comfortable with ChatGPT (and others) as code reviewers. They are a great help and spot edge cases you might miss: a great second pair of eyes. I wouldn’t be happy with them as the sole reviewer, though. With the right sort of detailed prompt, I could easily see using these tools to automatically assess and filter PRs for completeness and accuracy to guidelines.  

When I gave ChatGPT all the code generated by the AIs and asked for comparisons in terms of likely performance, bugs, efficiency, etc, the analysis was reasonable. Though I’ll leave you to decide how good/bad/baroque the generated code really is

Does AI win a 1BRC Prize?

Beyond accuracy, the real question is whether the AIs can create performant code. 

The 1BRC challenge repo has all the submitted code and the scores.  I generated a 1 Billion Record file and used it against the AI-generated code. I also picked a couple of the top-scoring solutions from the challenge and the baseline test case too. Note that I did not compile any of this code to native.

Unfortunately, the Microsoft CoPilot, Google Gemini, and ChatGPT code could not handle the quantity of data. They all failed. I went back to the AIs and attempted to improve the code. That was a flat bust with Microsoft CoPilot and Google Gemini while the new ChatGPT version did complete.

Elapsed timings
TestcaseSourceRun 1Run 2Run 3
Thomas Wue1BRC Winner16.45s16.41s16.33s
Gonix1BRC top performer18.29s17.36s16.92s
Claude.AIAI-Generated25.61s27.10s26.90s
Chat GPT V2AI Generated80.16s83.06s79.50s
CodeiumAI-Generated168.35s170.03s167.74s
Baseline1BRC Baseline173.69s174.55s173.28s
Amazon Q DeveloperAI-Generated199.00s197.39s198.74s
Here are the results from three runs of the code. Thomas Wue is shorthand for the winning 1BrC entry, while Gonix is shorthand for a laudable runner-up. Baseline, as you can guess, is the 1BRC entry for ‘average.’ 

Summary

Based on these results, it’s hard not to draw conclusions.

The obvious takeaway is that a general-purpose LLM, Claude.AI, produced some quite respectable code, while a Coding AI, Amazon Q Developer, produced code that was less performant than the baseline. However, this is not the complete story. It’s not that simple.

This has been a real voyage of discovery. The nieve challenge I placed on the AIs at the start and the subsequent search for getting something of value from them (at least in the coding space) have shown me that answering the initial question,” Are AI coding tools worth it?” is quite a nuanced challenge.

While I’ve been using AIs to try to generate code, documents, and test cases, I’ve also been using them successfully as advisers, code-completers, and analysts.

To get the most out of AI tools, we need to think differently about how we interact with them. In part 1 of this series, I discussed my experience with early speech-to-text software, how I infused it with sentience, and how I became angry when it didn’t behave like a person.

The challenge is that the way I wanted to interact with these AI’s doesn’t work. It’s not how they operate.  I have to change my thinking and work with the tools.

Using AIs to help with coding inside the IDE works well because the AI has good context and can offer advice you can ignore. It’s hard to ignore uncompilable code if it’s presented to you all in one go, but small snippets presented in real time are valuable and improve your productivity. The key is that as a developer, you’ll assess the AI output piecemeal and can disregard any you don’t want.   Some of the code snippets produced by Amazon Q Developer have been breathtaking in usefulness and accuracy.

Code Hallucinations?

Once we look at AIs at a broader level, at code generation at scale or code understanding across a project, it’s clear we’re not there yet.  Our ability to tweak and guide the generation of code or doc can be challenging.  Asking for doc is much more successful than requesting code, but in both cases, there is still a significant element of error to deal with – at some point, you have to decide to give up on adjusting the AI prompt and dive in and fix the output. 

Continuous incremental adjustment of the prompt doesn’t seem feasible (come back to the AI in a new session, and you’re likely to get a totally new set of code). In some cases, such as the generation of test cases,I wonder if it is sensible to expect the AI to be able to do this.

A significant amount of reasoning must be done to create complex code or comprehensive test cases and I don’t see that happening yet; we might forgive the wordplay in the doc – we can’t forgive coding errors. 

Are AI coding tools worth it?

The AI agents in the IDE are helpful but can get in my way.  I love their more comprehensive awareness of context and the code snippets offered.  I dont want to turn them off as I expect (hope) they will get better.  Interacting with these Code AI tools like I interact with Chat GPT feels wrong.

Maybe I’m missing some point-and-click elements – can I click on a class in IntelliJ and get nice and accurate Javadoc generated?  Can I click and get testcases generated that won’t hallucinate methods and fields?    If that’s possible, I’d like to know.

Otherwise, I’m going to judiciously continue using the free Code AI helpers while relying on ChatGPT and Claude.AI to help with everything else.

AI Coding tools do offer value but not necessarily (yet) in all the ways we might want.

As an aside, I’m not currently worried that AI is going to take my job, but when and if it can generate code that compiles, writes good documentation, produces good test cases, and has all the other production values, well, even then, I won’t worry—I’ll become a prompt engineer instead. 

Total
0
Shares
Previous Post

Are AI Coding tools worth it – RESULTS

Next Post

UnConferences – the antidote to conferences?

Related Posts