DE EN

Are AI Coding tools worth it – RESULTS

Steve Poole
Background

In Part 1, I examined the value of AI tools for developers, especially those that help them create or understand code. The tools selected are all free to use and a mix of general-purpose ones and those that declare they are specifically for coding purposes.

In Part 2, I examined these tools’ ability to find bugs and weaknesses. The results were spotty and not particularly useful or time-saving.

In this part, we’ll look at the results from asking these tools to create JUnit tests for the code and decide who wins (and why).

Tools being evaluated

1: ChatGPT  https://chatgpt.com/

2: Claude https://claude.ai/

3: Amazon Q Developer  https://aws.amazon.com/q/developer/

4: Codeium https://codeium.com/

5: JetBrains AI Assistant https://www.jetbrains.com/ai/

Testcases de Jure

After asking the AIs to explain the code and assess it for bugs, the plan was to see how well they could create test cases. Assuming they were good at reading code and understanding its objectives, surely writing test cases was feasible?

So far, the general-purpose LLMs, Chat GPT, and Claude.ai have been as good, if not better, than the focused tools. Good, of course, is relative. From a developer’s point of view, it’s not only about the accuracy of the generated material; it’s about how much it adds value versus the time spent evaluating and correcting the output.

The tools generated between 8 – 14 test cases each. I’m not going to publish them. The results are repetitive in the failures.

Overall, the result was almost unsurprising! I got what I would expect given the results so far: All the tools can generate good Junit boilerplate code (including setup code), but they were almost all prone to writing code that tried to access methods and fields that were out of scope, or that didn’t exist.

Code hallucinations

Code hallucinations include accessing methods that don’t exist. In this code below, the set and get methods don’t exist, and neither does the ConstantPool class (it’s an inner class)

@Test
    void testFieldName() {
        // Assuming we can set the name index and constant pool
        hsClassFile.setConstantPool(new ConstantPool()); // Mock ConstantPool
        hsClassFile.getConstantPool().setUTF8(1, "testField");
        assertEquals("testField", field.name());
    }

Note also the “// Assuming” comment. Many of the tools take that approach.

 @BeforeEach
    void setUp() {
        hsClassFile = new HSClassFile(); // Assuming there's a constructor
        field = hsClassFile.new Field(); // Assuming we can create Field instances this way
        method = hsClassFile.new Method();
    }
Surprise

My code includes a builder. It’s impossible to create any of the main classes without using it. All the tools have spotted this, but it’s been inconsistent. Amazon Q Developer and ChatGPT created a spot-on Javadoc but failed to develop test cases that used the builder. Claude.AI mentioned the builder but created no javadoc example. However, it did generate test code using the builder. Codeium’s javadoc was awful and just wrong. Its JUnit code tried to use the builder pattern but failed by calling methods that didn’t exist.

Surprise! JetBrains AI Assistant generated Junit code that compiled and worked. Its Javadoc output was mostly wrong, but hey, it generated useful test code.

Code Understanding

My efforts have been to understand these tools’ capabilities as developer aids beyond the auto-complete style we see in IDEs. Mostly, the results have been underwhelming. However, the JetBrains AI Assistant test cases show that there is promise here.

Both Chat GPT and JetBrains AI Assistant give a glimmer of more potential. Both created test cases similar to

 @Test
    public void testBuildClassFile() {
        HSClassFile classFile = builder
                .major(52)
                .build();
        
        assertEquals("Java 8", classFile.javaVersion());
        
    }

The code they had access to doesn’t have the mapping from class file major value to Java version, and the output from my code wouldn’t be “Java 8”, but somewhere, they understood the relationship that 52==Java 8

Summary: The winners are …

Well, that’s sort of hard to decide. In this particular exercise, there are at least two. In other circumstances, the winners might be different. For this exercise, we’ll declare ChatGPT and JetBrains AI Assistant as the winners (one for promising doc gen and code understanding and one for compilable code generation)

It’s feasible that, in practice, we’ll need multiple tools to cover the requirements I placed on the tools; using one tool to do everything is probably not optimal. It’s also sensible to spend time creating a much better and more comprehensive prompt

It’s pretty amazing that we got this far given three one-line prompts. I’m somewhat in awe of these tools, and it’s unfair to make any judgements based on this one example.

One-liner prompts don’t cut it, and although the results of this exercise have not been pretty, it’s as much my fault as the tools.

Next time, my requirements must be more precise and detailed. I also want to see if having the AI generate the code (rather than trying to understand it) will make a difference, and I have to be prepared to have a conversation with the AI and iterate.

Total
0
Shares
Previous Post

Are AI Coding tools worth it – Part 2