LangChain4j Production-ready features with Quarkus

Mohamed Ait Abderrahman

Have you already read the article “Build AI Apps and Agents in Java: Hands-On with LangChain4j” by Lize Raes in the “30 Years of Java” edition – part 1 – of that magazine? If so, you’re familiar with what LangChain4j is for.

If not, don’t worry—I included the link at the end of this article. I recommend reading it before continuing with this one.

In this article, we’ll explore and demonstrate important LangChain4j production-ready features using Quarkus, a lightweight Java framework that integrates seamlessly with LangChain4j.

What features will we cover?

We’ll focus on implementing “guardrails” (an integral part of maintaining secure and reliable interactions with LLMs), testing our AI-infused applications, and setting up proper observability.

Let’s get started!

Setting Up the Project

Before we dive in, let’s prepare our Quarkus project.

1- Clone the reference project

Let’s checkout this repo :

git clone https://github.com/anunnakian/langchain4j-new-features
cd langchain4j-new-features

The repository contains every snippet shown in this article, wired for Maven build. You can also check out the branch start to follow along and code with us through each step of this article.

git checkout start

2- Target platform versions

To ensure compatibility with the examples in this article, make sure your local environment uses these versions:

  • Java 25
  • Maven

3- Add the LangChain4j Quarkus extension

This Quarkus extension bundles the LangChain4j 1.5.0 core library, provides CDI integration, and configures models, metrics, and tracing without requiring additional code.

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j</artifactId>
    <version>1.3.0</version>
</dependency>

4- Run the application

./mvnw quarkus:dev

Now, try to access to http://localhost:8080. You should get the following UI interface :

Note: I’m using Qute, a Quarkus’ templating engine, for rendering cute UI interfaces. The method codegen.data() sends data to the UI view.

With the project setup complete, we’re ready to explore our defense against hallucinations and unsafe inputs, Guardrails!

Guardrails

LangChain4j Guardrails work like safety checks that make sure what goes into and comes out of LLM models meets certain rules. They catch made-up information, stop harmful inputs, and make sure responses follow the right format. This helps keep your applications reliable, trustworthy, and following your business rules.

Guardrails come in two main types:

  • Input Guardrails: These validate user inputs before they are sent to the LLM
  • Output Guardrails: These validate the responses returned by LLM

Now, let’s explore practical examples of implementing them.

Guardrails in Action

AI will never completely replace developers, that’s a fundamental truth.

Nevertheless, let’s create an application that explores this possibility. Why not! If we don’t do it, someone else will.

First of all, Let’s create our Java Developer AI Service in Quarkus:

// Java Developer v1
// src/main/java/com/javapro/langchain4j/guardrails/v1/JavaDeveloper.java
import dev.langchain4j.service.SystemMessage;
import io.quarkiverse.langchain4j.RegisterAiService;

@RegisterAiService
@SystemMessage("""
    You are an expert Java developer.
    You Generate Java code as a raw code without any markdown or code fences or code block.
    """)
public interface JavaDeveloper {

    String writeCode(String request);
}

To create an AI service with LangChain4j in Quarkus, annotate your interface with @RegisterAiService. By default, this creates a request-scoped bean, but I’ve added @ApplicationScoped to make it available application-wide.

Next, define methods that will invoke the LLM. In this example, the writeCode method takes a request as a string and returns generated code as a string. The @SystemMessage annotation provides essential context to guide the LLM’s response.

Now, let’s configure our model in the application.properties file by setting up OpenAI’s GPT as our chat model:

# src/main/resources/application.properties
quarkus.langchain4j.openai.api-key=${OPENAI_API_KEY}

As you see, we declare OPENAI_API_KEY as an environment variable, which is more secure. Then you specify the model you want to use and the model provider. The default GPT model used here is gpt-4o-mini, which costs $0.55 per 1M input tokens, $0.138 per 1M cached tokens, and $2.20 per 1M output tokens.

You can add this configuration to display the LLM interactions in your logs:

# src/main/resources/application.properties
quarkus.langchain4j.log-requests=true
quarkus.langchain4j.log-responses=true

Alright, let’s test it!

Modify the DeveloperController.generate() method to use our AI Service :

// src/main/java/com/javapro/langchain4j/DeveloperController.java
@POST
@Consumes(MediaType.APPLICATION_FORM_URLENCODED)
@Produces(MediaType.TEXT_HTML)
public TemplateInstance generate(@FormParam("prompt") String prompt) {
   try {
       String result = ai.writeCode(prompt);
       return codegen.data("prompt", prompt, "result", result, "error", null);
   } catch (GuardrailException e) {
       return codegen.data("prompt", prompt, "error", extractUserMessage(e), "result", null);
   }
}

let’s try this request :

Generate a Java function that uses the ls command to display a list of files

It works. Our Ai Java Developer is ready to use! Youhou !

Examining the output, we can see that the generated code executes the “ls” command using ProcessBuilder class. This presents a potential issue since your team may have rules prohibiting the use of system commands like “ls” in your project. Such commands can introduce significant security vulnerabilities in your application.

Since we’re using AI, we should expect intelligent responses that adhere to security best practices. It should recognize potential security risks and avoid generating vulnerable code.

This is where guardrails become essential.

Output GuardrailS

Output guardrails, as we said before, can catch potentially problematic text after the LLM generates a response but before it reaches the user.

In our case, they function as security checkpoints that filter out risky code and ensure that generated code adheres to our security protocols and coding standards.

Here’s how to implement our output guardrails:

// src/main/java/com/javapro/langchain4j/guardrails/v2/CodegenOutputGuardrail.java
import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.guardrail.OutputGuardrail;
import dev.langchain4j.guardrail.OutputGuardrailResult;
import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
public class CodegenOutputGuardrail implements OutputGuardrail {

    @Override
    public OutputGuardrailResult validate(AiMessage aiMessage) {
        String code = aiMessage.text();

        if (code.contains("ProcessBuilder") || code.contains("exec(")) {
            return fatal("Generated code must not use ProcessBuilder or Runtime class.");
        }

        return OutputGuardrailResult.success();
    }
}

Simple, isn’t it ?

We implemented the OutputGuardrail interface and defined our validation logic in the validate() method. This method examines the LLM’s response to check if the generated code contains risky patterns and returns an appropriate OutputGuardrailResult. In our implementation, we added a condition to check if the output code uses the forbidden Runtime.exec() method to execute the “ls” command. If one of those conditions is true, the output guardrail blocks the response by calling the fatal() method with an explanatory error message as its argument.

Now, let’s update our Ai Java Developer Service to use our output guardrails:

// Java Developer v2
// src/main/java/com/javapro/langchain4j/guardrails/v2/JavaDeveloper.java
import dev.langchain4j.service.SystemMessage;
import io.quarkiverse.langchain4j.RegisterAiService;
import dev.langchain4j.service.guardrail.OutputGuardrails;

@RegisterAiService
@SystemMessage("""
    You are an expert Java developer.
    You Generate Java code as a raw code without any markdown or code fences or code block.
    """)
public interface JavaDeveloper {

    <strong>@OutputGuardrails(CodegenOutputGuardrail.class)</strong>
    String writeCode(String request);
}

To implement this, simply annotate your method with the @OutputGuardrails annotation and specify your OutputGuardrail class as a parameter. That’s all you need to do!

Let’s test the same prompt as before with our output guardrails in place:

./mvnw clean quarkus:dev

Now, the system prevents code generation that uses risky code and returns an error message instead!

Should we consider our job complete?

Not quite, and here’s why:

While we’ve successfully blocked non-compliant code from reaching users, we’re still invoking the LLM each time someone attempts to generate code that uses system commands like “ls” or “ps”. This means we’re paying for tokens with every request, regardless of whether it succeeds or fails.

This creates a financial vulnerability. If someone discovers this weakness, they could deliberately drain your resources by repeatedly requesting code generations where the corresponding generated code will ultimately be rejected.

Input Guardrails offer a solution to this problem.

INPUT GUARDRAILS

The input guardrail, just like the output guardrails, helps us catch potentially dangerous code requests but this time before the LLM is even invoked.

Here is how to implement it :

// src/main/java/com/javapro/langchain4j/guardrails/v3/CodegenInputputGuardrail.java
import dev.langchain4j.guardrail.InputGuardrail;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.guardrail.InputGuardrailResult;
import jakarta.enterprise.context.ApplicationScoped;

import java.util.List;

@ApplicationScoped
public class CodegenInputGuardrail implements InputGuardrail {

	private static final List<String> commands = List.of("ls", "ps", "top", "whoami");

	@Override
	public InputGuardrailResult validate(UserMessage message) {
		for (String cmd : commands) {
			if (message.singleText().contains(cmd)) {
				return fatal("Detected use of system command: " + cmd);
			}
		}
		return InputGuardrailResult.success();
	}
}

The implementation analyzes if the user’s prompt contains Linux-system commands like “ls”. When these patterns are detected, the guardrail immediately returns a fatal result with an appropriate error message, preventing the request from reaching the LLM.

This approach saves on token consumption costs and strengthens our application’s security posture by failing fast.

Let’s add it to our Ai service :

// Java Developer v3
// src/main/java/com/javapro/langchain4j/guardrails/v3/JavaDeveloper.java
import com.javapro.langchain4j.guardrails.v2.CodegenOutputGuardrail;
import dev.langchain4j.service.SystemMessage;
import dev.langchain4j.service.guardrail.InputGuardrails;
import io.quarkiverse.langchain4j.RegisterAiService;
import dev.langchain4j.service.guardrail.OutputGuardrails;

@RegisterAiService
@SystemMessage("""
        You are an expert Java developer.
        You Generate Java code as a raw code without any markdown or code fences or code block.
        """)
public interface JavaDeveloper {

    @InputGuardrails(CodegenInputGuardrail.class)
    @OutputGuardrails(CodegenOutputGuardrail.class)
    String writeCode(String request);
}

Let’s try our preferred prompt request :

Generate a Java function that uses the ls command to display a list of files

At this point, we can say that we’re happy!

Let’s go further. If a user of our AI Java Developer, or should I say a human developer, tries to use our generator and passes personal information or secrets like a password, we should avoid sending this to an LLM, right? The first solution that comes to mind is to modify our input guardrail to implement this rule. But wait, what about SOLID principles? Specifically, the Single Responsibility Principle. We shouldn’t give more than one responsibility to a class. So we need another input guardrail!

Hear me out, LangChain4j supports multiple input and output guardrails. Let’s dive into it.

Multiple Guardrails

Let’s implement an additional input guardrail to ensure the code don’t includes passwords :

// src/main/java/com/javapro/langchain4j/guardrails/v4/PasswordProtectorInputGuardrail.java
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.guardrail.InputGuardrail;
import dev.langchain4j.guardrail.InputGuardrailResult;
import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
public class PasswordProtectorInputGuardrail implements InputGuardrail {

    @Override
    public InputGuardrailResult validate(UserMessage message) {
        String request = message.singleText();

        if (PasswordDetector.detectPasswordsInText(request)) {
            return fatal("You have tried to send sensitive data to the LLM. Be careful !");
        }

        return InputGuardrailResult.success();
    }
}

N.B: You can find the PasswordDetector class source code in the repo. It helps us to detect a password in the prompt.

In this code, we validate that the prompt doesn’t contain a password. If it does, we raise an error and don’t call the LLM.

Let’s add it to our Java Developer :

// Java Developer v4
// src/main/java/com/javapro/langchain4j/guardrails/v4/JavaDeveloper.java
import com.javapro.langchain4j.guardrails.v2.CodegenOutputGuardrail;
import com.javapro.langchain4j.guardrails.v3.CodegenInputGuardrail;
import dev.langchain4j.service.SystemMessage;
import io.quarkiverse.langchain4j.RegisterAiService;
import io.quarkiverse.langchain4j.guardrails.InputGuardrails;
import io.quarkiverse.langchain4j.guardrails.OutputGuardrails;
import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
@RegisterAiService
public interface JavaDeveloper {

	@InputGuardrails({CodegenInputGuardrail.class, <strong>PasswordProtectorInputGuardrail.class</strong>})
	@OutputGuardrails(CodegenOutputGuardrail.class)
	@SystemMessage("""
		You are a helpful code generator. Generate Java code and return it as raw code
		without any markdown or code fences or code block.
		"""
	)
	String writeCode(String request);
}


Let’s try a risky prompt : Create a class repository to access to my database, here is the username john.doe and the password javapro120049@d

How do multiple guardrails work together?

They execute as a chain of calls. When the first guardrail calls the fail() method, it stores that error message and continues to the next guardrail in the chain.

When a guardrail calls fatal(), the system immediately throws an exception, halting execution.

When a guardrail passes validation, the next one in the chain executes. If any subsequent guardrail fails, you’ll still get an exception.

Below is a summary of possible guardrail return values, which are described in detail in the Guardrails article listed in the references section :

Validation OutcomeEffect
passValidation successful; processing continues.
failInvalid input/output; next guardrail is still executed.
fatalHalts the chain immediately and throws an error.
fatal with retryRetries the same prompt.
fatal with repromptSends a new prompt with guidance.
pass with rewriteRewritten output is used and passed to the next guardrail.
Possible Guardrails validate() method outcomes and their effect (LangChain4j documentation) : https://docs.langchain4j.dev/tutorials/guardrails/#input-guardrail-outcomes

Now, let’s consider a different approach. Rather than managing the rest of all team code convention rules myself, I want to use an AI service to guide my AI Java Developer. Is this possible?

Yes, it is.

How would that work?

Let me show you. In your Guardrails, you can inject a proper instance of an AI service, this is the magic of Quarkus 🪄

Guardrails using AI Service

We need a new Ai Service to validate whether users clearly specify their code generation requests to avoid unnecessary back-and-forth.

We want to delegate prompt verification to the LLM, checking whether the prompt clearly describes the class name and method names for the code we want to generate.

We want to avoid this kind of code :

Let’s create our new Quarkus AI service:

// src/main/java/com/javapro/langchain4j/guardrails/v5/CompletenessValidator.java
import dev.langchain4j.service.SystemMessage;
import io.quarkiverse.langchain4j.RegisterAiService;
import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
@RegisterAiService
public interface CompletenessValidator {

    @SystemMessage("You are a code generation prompt reviewer." +
            "Check if the user prompt specifies a valid Java class or method name, " +
            "and describes what it should do. Reply with 'VALID' or 'INVALID'.")
    String isRequestComplete(String request);
}

Now, let’s implement this in our input Guardrail:

// src/main/java/com/javapro/langchain4j/guardrails/v5/CompletenessValidator.java
import io.quarkiverse.langchain4j.guardrails.InputGuardrail;
import io.quarkiverse.langchain4j.guardrails.InputGuardrailResult;
import dev.langchain4j.data.message.UserMessage;
import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
public class CompletenessInputGuardrail implements InputGuardrail {

    CompletenessValidator completenessValidator;

    public CompletenessInputGuardrail(CompletenessValidator completenessValidator) {
        this.completenessValidator = completenessValidator;
    }

    @Override
    public InputGuardrailResult validate(UserMessage userMessage) {
        // Ask the model to verify prompt completeness
        var conversation = completenessValidator.isRequestComplete(userMessage.singleText());

        if (!"VALID".equalsIgnoreCase(conversation)) {
            return fatal(
                "Prompt is incomplete: please specify a class or method name and describe desired behavior."
            );
        }

        return InputGuardrailResult.success();
    }
}

The CompletenessInputGuardrail uses an injected CompletenessValidator service to validate user requests. In its validate() method, it asks the Ai Service to determine if the request contains sufficient detail. When the validator returns anything other than VALID, the guardrail immediately halts processing, preventing incomplete requests from reaching the code generation phase.

This approach leverages the AI model itself for robust input validation, ensuring that your code generator service consistently receives well-formed requests that can produce useful results.

Let’s add this guardrails to our Ai Java Developer :

// Java Developer v5
import com.javapro.langchain4j.guardrails.v2.CodegenOutputGuardrail;
import com.javapro.langchain4j.guardrails.v3.CodegenInputGuardrail;
import com.javapro.langchain4j.guardrails.v4.CodeStyleOutputGuardrail;
import dev.langchain4j.service.SystemMessage;
import io.quarkiverse.langchain4j.RegisterAiService;
import io.quarkiverse.langchain4j.guardrails.InputGuardrails;
import io.quarkiverse.langchain4j.guardrails.OutputGuardrails;
import jakarta.enterprise.context.ApplicationScoped;

@ApplicationScoped
@RegisterAiService
public interface JavaDeveloper {

	@InputGuardrails({CodegenInputGuardrail.class, PasswordProtectorInputGuardrail.class, <strong>CompletenessInputGuardrail.class</strong>})
	@OutputGuardrails(CodegenOutputGuardrail.class)
	@SystemMessage("""
		You are a helpful code generator. Generate Java code and return it as raw code
		without any markdown or code fences or code block.
		"""
	)
	String writeCode(String request);
}

And try it with the previous request : generate a java function that adds two number

This isn’t possible anymore because our AI Service doesn’t approve the request as a complete prompt.

let’s modify it like this : Generate a class named Math that contains a function named “sumInteger” that adds two number

It works just fine !

However, our sophisticated AI Java Developer requires comprehensive testing to ensure it continues to function correctly when changes are made. Testing is essential to verify that our applications perform as expected and prevent potential regressions. While testing our input and output Guardrails is valuable, we must also determine how to effectively test the Ai Services themselves.

So how can we test non-deterministic applications?

Testing non-deterministic applications

AI-infused applications behave fundamentally differently from traditional Java applications. Even with identical inputs, language models produce different responses. Without robust testing, these variations can lead to incorrect or unexpected outputs that often only become apparent in production.

The quarkus-langchain4j-testing-scorer-junit5 Quarkus extension effectively addresses this challenge. By annotating your test class with @QuarkusTest and @AiScorer, you receive a configured Scorer object. This tool enables you to evaluate sample inputs against expected outputs, choose appropriate comparison methods such as semantic similarity checks or AI-based evaluation and run comprehensive tests. The extension provides clear scoring metrics and highlights successful and failed test cases.

This testing approach helps detect AI regressions early, ensuring our AI Java Developer remain reliable and continue to write clean code and help the team to be more productive.

Let’s practice !

To test your AI-infused applications, you’ll need to add the following extension dependency to your project:

Maven:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-testing-scorer-junit5</artifactId>
    <version>1.3.1</version>
    <scope>test</scope>
</dependency>

Simple test

Let’s begin with a simple test case. We’ll create a test to verify if our CompletenessValidator works as expected:

import com.javapro.langchain4j.guardrails.v5.CompletenessValidator;
import io.quarkiverse.langchain4j.scorer.junit5.AiScorer;
import io.quarkiverse.langchain4j.scorer.junit5.SampleLocation;
import io.quarkiverse.langchain4j.scorer.junit5.ScorerConfiguration;
import io.quarkiverse.langchain4j.testing.scorer.EvaluationReport;
import io.quarkiverse.langchain4j.testing.scorer.Samples;
import io.quarkiverse.langchain4j.testing.scorer.Scorer;
import io.quarkus.test.junit.QuarkusTest;
import jakarta.inject.Inject;
import org.junit.jupiter.api.Test;

import static org.assertj.core.api.Assertions.assertThat;

@AiScorer
@QuarkusTest
class CompletenessValidatorTest {

    @Inject
    CompletenessValidator completenessValidator;

    @Test
    void test_01_CompletenessValidator(
            @ScorerConfiguration(concurrency = 2) Scorer scorer,
            @SampleLocation("src/test/resources/prompt-samples.yaml") Samples<String> samples) {

        EvaluationReport<String> report = scorer.evaluate(
                samples,
                prompt -> completenessValidator.isRequestComplete(prompt.get(0)),
                (sample, verdict) -> sample.expectedOutput().equalsIgnoreCase(verdict)
        );

        assertThat(report.score()).isEqualTo(100);
    }
}

For Quarkus AI testing, you need two key annotations: @QuarkusTest starts your application context and handles dependency injection for your application components, while @AiScorer provides the scoring framework. This setup injects a configured Scorer object and enables YAML-based test samples. If you prefer a more lightweight approach, you can replace @AiScorer with @ExtendWith(ScorerExtension.class).

As you see, our test has two arguments :

  • @ScorerConfiguration(concurrency = 2)

Configures the Scorer to process up to 2 samples in parallel, which significantly speeds up tests containing many samples.

  • @SampleLocation("…/prompt-samples.yaml")

Specifies the location of the YAML file containing your test cases. The test case entries look like this (don’t forget to create this file) :

# src/test/resources/prompt-samples.yaml
- name: "Valid prompt"
  parameters:
    - "Generate a Java class UserService with CRUD methods."
  expected-output: "VALID"

- name: "invalid prompt"
  parameters:
    - "Make a utility."
  expected-output: "INVALID"

These samples are loaded as Sample<String> objects that your test can process.

The scorer.evaluate(sample, function, strategy) takes three parameters: the samples (as explained above), a function to call, and a strategy to evaluate the function’s output.

When executed, it performs these steps:

  1. The evaluate() method perfoms the parallel test execution for each sample.
  2. Calls your function using sample parameters prompt.get(0) and returns the result as a verdict.
  3. Compares each result with its expected output.
  4. Gathers all results into an EvaluationReport instance that tracks pass/fail statistics.
  5. The report.score() gives you the percentage of passed tests (0-100).

The assertion requires a perfect 100% score. All samples must pass for the test to succeed.

The test is green after running it ✅

Great feature!

We can take this further by using more sophisticated strategies to evaluate samples against verdicts. The Quarkus extension comes with two powerful built-in evaluation strategies that cover most testing needs.

Using built-in Strategies

Similarity strategy

Quarkus LangChain4j’s SemanticSimilarityStrategy provides fast, embedding-based comparisons. It converts both your expected and actual outputs into dense embeddings (vectors) and calculates their cosine similarity. When this similarity meets or exceeds your defined threshold, the test passes.

To implement this, add both the semantic-similarity test module and an Embedding Model library to your project :

Maven :

<dependency>
  <groupId>io.quarkiverse.langchain4j</groupId>
  <artifactId>quarkus-langchain4j-testing-scorer-semantic-similarity</artifactId>
  <scope>test</scope>
</dependency>
<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-embeddings-bge-small-en-v15</artifactId>
  <version>1.8.0-beta15</version>
  <scope>test</scope>
</dependency>

Let’s test our AI Java Developer service by using the semantic similarity approach. We’ll provide a SemanticSimilarityStrategy instance to the scorer with a specific threshold:

import com.javapro.langchain4j.guardrails.v5.JavaDeveloper;
import dev.langchain4j.model.openai.OpenAiChatModel;
import io.quarkiverse.langchain4j.scorer.junit5.AiScorer;
import io.quarkiverse.langchain4j.scorer.junit5.SampleLocation;
import io.quarkiverse.langchain4j.scorer.junit5.ScorerConfiguration;
import io.quarkiverse.langchain4j.testing.scorer.EvaluationReport;
import io.quarkiverse.langchain4j.testing.scorer.Samples;
import io.quarkiverse.langchain4j.testing.scorer.Scorer;
import io.quarkiverse.langchain4j.testing.scorer.judge.AiJudgeStrategy;
import io.quarkiverse.langchain4j.testing.scorer.similarity.SemanticSimilarityStrategy;
import io.quarkus.test.junit.QuarkusTest;
import jakarta.inject.Inject;
import org.junit.jupiter.api.Test;

import static org.assertj.core.api.Assertions.assertThat;

@AiScorer
@QuarkusTest
class JavaDeveloperTest {

	@Inject
	JavaDeveloper javaDeveloper;

	@Test
    void test_02_Semantic_Similarity_Strategy(
            @ScorerConfiguration(concurrency = 5) Scorer scorer,
            @SampleLocation("src/test/resources/codegen-samples.yaml") Samples<String> samples) {


		EvaluationReport<String> report = scorer.evaluate(
				samples,
				prompt -> javaDeveloper.writeCode(prompt.get(0)),
				new SemanticSimilarityStrategy(0.85)  // 85% similarity threshold
		);

        assertThat(report.score()).isGreaterThanOrEqualTo(100);
    }
}

This is our samples for this scenario :

- name: "Singleton"
  parameters:
    - "Generate a thread-safe singleton class named ConfigManager in Java."
  expected-output: |
    public class ConfigManager {
        private static volatile ConfigManager instance;
        private ConfigManager() {}
        public static ConfigManager getInstance() {
            if (instance == null) {
                synchronized (ConfigManager.class) {
                    if (instance == null) {
                        instance = new ConfigManager();
                    }
                }
            }
            return instance;
        }
    }
  tags: ["singleton"]

- name: "Builder"
  parameters:
    - "Create a Builder pattern for a Person class with fields name (String) and age (int)."
  expected-output: |
    public class Person {
        private final String name;
        private final int age;

        private Person(Builder builder) {
            this.name = builder.name;
            this.age = builder.age;
        }
    
        public String getName() {
            return name;
        }
    
        public int getAge() {
            return age;
        }

        public static class Builder {
            private String name;
            private int age;

            public Builder name(String name) {
                this.name = name;
                return this;
            }

            public Builder age(int age) {
                this.age = age;
                return this;
            }

            public Person build() {
                return new Person(this);
            }
        }
    }
  tags: ["builder"]

- name: "Factory"
  parameters:
    - "Implement a Factory class named VehicleFactory with method that returns a Vehicle instance based on a type parameter ('car' or 'truck')."
  expected-output: |
    public interface Vehicle { void drive(); }

    public class Car implements Vehicle {
        @Override public void drive() { System.out.println("Driving a car"); }
    }

    public class Truck implements Vehicle {
        @Override public void drive() { System.out.println("Driving a truck"); }
    }

    public class VehicleFactory {
        public static Vehicle create(String type) {
            return switch (type.toLowerCase()) {
                case "car"   -> new Car();
                case "truck" -> new Truck();
                default      -> throw new IllegalArgumentException("Unknown type: " + type);
            };
        }
    }
  tags: ["factory"]

- name: "UnitTestSkeleton"
  parameters:
    - "Generate a JUnit 5 test skeleton for a service class MyService with a single test stub."
  expected-output: |
    import org.junit.jupiter.api.Test;
    import static org.junit.jupiter.api.Assertions.*;

    public class MyServiceTest {

        private final MyService service = new MyService();

        @Test
        void testDoWork() {
            // TODO: arrange inputs

            // TODO: act
            var result = service.doWork();

            // TODO: assert
            assertNotNull(result);
        }
    }
  tags: ["unittest"]

Under the hood, each pair of texts is embedded and compared. A threshold of 0.85 means the similarity strategy requires at least 85% similarity between the expected and actual outputs.

At the end of the evaluation, the scorer should reach 100%, meaning all samples must meet the 85% similarity threshold.

Let’s run it….✅

AI Judge STrategy

When dealing with AI services that produce open-ended or highly variable text like code, simple similarity checks often fail to capture important nuances such as tone, completeness, or style. The AiJudgeStrategy addresses this limitation by using an LLM as a judge to evaluate whether each actual response appropriately matches the expected output.

To implement this, add this dependency to your project :

Maven :

<dependency>
  <groupId>io.quarkiverse.langchain4j</groupId>
  <artifactId>quarkus-langchain4j-testing-scorer-ai-judge</artifactId>
  <scope>test</scope>
</dependency>

Here is the test using Ai as judge :

@Test
void test_03_Ai_Judge_Strategy(
		@ScorerConfiguration(concurrency = 5) Scorer scorer,
		@SampleLocation("src/test/resources/codegen-samples.yaml") Samples<String> samples
	) {
	var judgeModel = OpenAiChatModel.builder()
			.apiKey(System.getenv("OPENAI_API_KEY"))
			.modelName("gpt-4o")
			.build();

	String evaluationPrompt = """
            You are an AI assistant evaluating a generated Java code snippet against an expected implementation.
            Determine if the generated code follows similar java concept as expected code.
            Return true if the generated code and the expected code are similar, false otherwise.

            Generated code: {response}
            Expected code: {expected_output}

            Respond with only 'true' or 'false'.
            """;

	EvaluationReport<String> report = scorer.evaluate(
			samples,
			prompt -> javaDeveloper.writeCode(prompt.get(0)),
			new AiJudgeStrategy(judgeModel, evaluationPrompt)
	);

	// We expect at least 70% of samples to be judged correct
	assertThat(report.score())
			.as("At least 70% of generated code snippets should correctly implement the pattern")
			.isGreaterThanOrEqualTo(70);
}

And here is our samples file :

- name: "Singleton"
  parameters:
    - "Generate a thread-safe singleton class named ConfigManager in Java."
  expected-output: |
    public class ConfigManager {
        private static volatile ConfigManager instance;
        private ConfigManager() {}
        public static ConfigManager getInstance() {
            if (instance == null) {
                synchronized (ConfigManager.class) {
                    if (instance == null) {
                        instance = new ConfigManager();
                    }
                }
            }
            return instance;
        }
    }
  tags: ["singleton"]

- name: "Builder"
  parameters:
    - "Create a Builder pattern for a Person class with fields name (String) and age (int)."
  expected-output: |
    public class Person {
        private final String name;
        private final int age;

        private Person(Builder builder) {
            this.name = builder.name;
            this.age = builder.age;
        }
    
        public String getName() {
            return name;
        }
    
        public int getAge() {
            return age;
        }

        public static class Builder {
            private String name;
            private int age;

            public Builder name(String name) {
                this.name = name;
                return this;
            }

            public Builder age(int age) {
                this.age = age;
                return this;
            }

            public Person build() {
                return new Person(this);
            }
        }
    }
  tags: ["builder"]

- name: "Factory"
  parameters:
    - "Implement a Factory method that returns a Vehicle instance based on a type parameter ('car' or 'truck')."
  expected-output: |
    public interface Vehicle { void drive(); }

    public class Car implements Vehicle {
        @Override public void drive() { System.out.println("Driving a car"); }
    }

    public class Truck implements Vehicle {
        @Override public void drive() { System.out.println("Driving a truck"); }
    }

    public class VehicleFactory {
        public static Vehicle create(String type) {
            return switch (type.toLowerCase()) {
                case "car"   -> new Car();
                case "truck" -> new Truck();
                default      -> throw new IllegalArgumentException("Unknown type: " + type);
            };
        }
    }
  tags: ["factory"]

- name: "UnitTestSkeleton"
  parameters:
    - "Generate a JUnit 5 test skeleton for a service class MyService with a single test stub."
  expected-output: |
    import org.junit.jupiter.api.Test;
    import static org.junit.jupiter.api.Assertions.*;

    public class MyServiceTest {

        private final MyService service = new MyService();

        @Test
        void testDoWork() {
            // TODO: arrange inputs

            // TODO: act
            var result = service.doWork();

            // TODO: assert
            assertNotNull(result);
        }
    }
  tags: ["unittest"]

Notice that when we instantiate AiJudgeStrategy, we provide a Chat model instance using the LangChain4j API and a prompt to guide the judge to evaluate whether the generated Java code correctly implements the expected pattern. This strategy is particularly valuable in our case because :

  • Our code outputs are open-ended and can vary significantly in structure while still being correct.
  • We need a human-like judgment that can assess code correctness beyond simple text matching, considering factors like implementation approach, completeness, and adherence to Java patterns.

Let’s run it as well …✅

Choosing the Right Strategy

  • Go Semantic when your app’s responses follow a clear template or you need to guard against major meaning drift. It’s fast, cheap, and easy to configure.
  • Go AI Judge when you care about nuances like punctuation, politeness, or when “correctness” extends beyond keyword matching. It’s more flexible but requires an extra LLM call per test.

Custom Evaluation Strategies

If the built-in strategies don’t meet your specific evaluation needs, you can easily create your own custom strategy.

Let’s implement a custom strategy based on Levenshtein distance as an alternative to semantic similarity. This approach measures the edit distance between strings, calculating how many character changes are needed to transform one string into another. For example:

kitten → sitting   // distance = 3
        k  i t t e n
        | / \  \  \
        s  i t t i n g
           ↑  ↑     ↑
         substitution, substitution, insertion

You should add this dependency to our project if you want to use Levenshtein Distance class :

Maven :

<dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-text</artifactId>
      <version>1.12.0</version>
</dependency>

Gradle :

implementation("org.apache.commons:commons-text:1.12.0")

The LevenshteinDistanceStrategy class below calculates a edit-distance score :

import io.quarkiverse.langchain4j.testing.scorer.EvaluationSample;
import io.quarkiverse.langchain4j.testing.scorer.EvaluationStrategy;
import org.apache.commons.text.similarity.LevenshteinDistance;

public class LevenshteinDistanceStrategy implements EvaluationStrategy<String> {

    private final double threshold;
    private final LevenshteinDistance levenshtein;

    public LevenshteinDistanceStrategy(double threshold) {
        if (threshold < 0.0 || threshold > 1.0) {
            throw new IllegalArgumentException("Threshold must be between 0.0 and 1.0");
        }
        this.threshold = threshold;
        this.levenshtein = new LevenshteinDistance();
    }

    @Override
    public boolean evaluate(EvaluationSample<String> sample, String output) {
        String expected = sample.expectedOutput();

        // compute raw edit distance
        int dist = levenshtein.apply(expected, output);
        int maxLen = Math.max(expected.length(), output.length());
        // avoid division by zero
        if (maxLen == 0) {
            return true;
        }

        // normalized similarity: 1 - (distance / maxLen)
        double similarity = 1.0 - ((double) dist / maxLen);
        return similarity >= threshold;
    }
}

These tests if the similarity score exceeds our defined threshold.

@Test
void test_04_Custom_Evaluation_Strategy(
		@ScorerConfiguration(concurrency = 5) Scorer scorer,
		@SampleLocation("src/test/resources/codegen-samples.yaml") Samples<String> samples) {

	LevenshteinDistanceStrategy presenceStrategy = new LevenshteinDistanceStrategy(0.4);

	EvaluationReport<String> report = scorer.evaluate(
			samples,
			topic -> javaDeveloper.writeCode(topic.get(0)),
			presenceStrategy
	);

	assertThat(report.score()).isGreaterThanOrEqualTo(70);
}
	
@Test
void test_04_Custom_Evaluation_Strategy_Not_Valid(
		@ScorerConfiguration(concurrency = 5) Scorer scorer,
		@SampleLocation("src/test/resources/invalid-codegen-samples.yaml") Samples<String> samples) {

	LevenshteinDistanceStrategy presenceStrategy = new LevenshteinDistanceStrategy(0.4);

	EvaluationReport<String> report = scorer.evaluate(
			samples,
			topic -> javaDeveloper.writeCode(topic.get(0)),
			presenceStrategy
	);

	assertThat(report.score()).isLessThan(30);
}

What’s happening in our Custom Evaluation Test?

  • We create our strategy with a threshold of 0.4 (40% similarity required to pass).
  • We run it against both valid and invalid code samples to verify it correctly identifies good and bad outputs.
  • For valid samples, we expect at least 70% to pass our similarity threshold.
  • For invalid samples, we expect less than 30% to pass, confirming our strategy identifies poor outputs.

This pattern demonstrates how you can create custom evaluation logic by implementing EvaluationStrategy<T>, allowing you to validate AI outputs using domain-specific criteria beyond what’s available in the built-in strategies.

With our application now protected by guardrails and thoroughly tested, we can confidently deploy to production with peace of mind. 🎉

Fast forward a month, and you open your OpenAI API invoice—yikes—your GPT bill has skyrocketed. Every code generation burns through tokens, much to your model provider’s delight.

Before your wallet surrenders, it’s time to get serious about monitoring resource consumption and performance. How many tokens are your services consuming? Which endpoints are the most resource-intensive? Are response times slowing during peak traffic? In the next chapter, we’ll explore observability for your Quarkus LangChain4j application tracking usage, tracing requests, and helping you optimize costs while maintaining smooth operation of your AI services.

Observability

When putting AI-infused applications into production, we need to know what’s happening inside them. Observability lets us see how our app talks to AI models in real-time. We can track what’s happening, measure performance, and fix problems, just like we do with regular Java application.

Using tools like OpenTelemetry and Micrometer with Quarkus, we can collect important data like how many tokens we’re using and how long requests take. This works whether you’re calling OpenAI directly or using more complex AI patterns like A2A (Agent to Agent).

This isn’t just for fixing bugs ! it helps catch AI mistakes like hallucinations or inappropriate outputs. Combined with guardrails, observability turns AI from a mysterious black box into something we can understand and trust.

As your AI applications grow, good observability keeps everything accountable and explainable.

In Quarkus, observability is automatically integrated into services created with @RegisterAiService in two ways:

  • Metrics: These reveal the “what” and “how often” of your application’s behavior – measuring the frequency and quantity of actions.
  • Traces: These show the “why” and “what happened where” – providing detailed pathways of execution through your system.

Since this is a hands-on article, let’s set up the environment to monitor our AI-infused application in action.

Setting up

We’re using the Quarkus Observability Dev Services with LGTM stack. This is an all-in-one Docker image that sets up a complete monitoring system. It contains an OpenTelemetry Collector that gathers telemetry data and sends it to three specialized systems: Prometheus (for metrics), Tempo (for traces), and Loki (for logs). You can then view all this information through Grafana dashboards.

Here’s what each letter in LGTM stands for:

  • L for Loki : Thor’s brother and son of Odin…I’m kidding. It handles your application logs.
  • G for Grafana : Creates beautiful visualizations of your metrics.
  • T for Tempo : Manages distributed traces to track request flows.
  • M for Mimir : Provides long-term storage for your Prometheus metrics (We don’t need it)

Let me walk you through what this is:

1- Add the dependency to your project

First, add the Grafana OTel-LGTM to your project.

Maven :

<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-observability-devservices-lgtm</artifactId>
    <scope>provided</scope>
</dependency>

Gradle :

implementation("io.quarkus:quarkus-observability-devservices-lgtm")

2- Run the application

# Maven
./mvnw quarkus:dev

# Gradle
./gradlew quarkusDev

3- Access to Grafana UI

When reviewing your application startup logs, you’ll find the Grafana URL there.

2025-07-20 18:47:41,400 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1666611896) [LGTM] STDOUT: The OpenTelemetry collector and the Grafana LGTM stack are up and running. (created /tmp/ready)
2025-07-20 18:47:41,400 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1666611896) [LGTM] STDOUT: Open ports:
2025-07-20 18:47:41,400 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1666611896) [LGTM] STDOUT:  - 4317: OpenTelemetry GRPC endpoint
2025-07-20 18:47:41,401 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1666611896) [LGTM] STDOUT:  - 4318: OpenTelemetry HTTP endpoint
2025-07-20 18:47:41,401 INFO  [io.qua.obs.tes.LgtmContainer] (docker-java-stream--1666611896) [LGTM] STDOUT:  - 3000: Grafana. User: admin, password: admin
2025-07-20 18:47:47,193 INFO  [tc.doc.io/.11.0] (build-64) Container docker.io/grafana/otel-lgtm:0.11.0 started in PT13.6357992S
2025-07-20 18:47:47,193 INFO  [io.qua.obs.dep.ObservabilityDevServiceProcessor] (build-64) Dev Service Lgtm started, config: {grafana.endpoint=http://localhost:55786, quarkus.otel.exporter.otlp.endpoint=http://localhost:55788, quarkus.otel.exporter.otlp.protocol=http/protobuf, otel-collector.url=localhost:55788}
__  ____  __  _____   ___  __ ____  ______ 
 --/ __ / / / / _ | / _ / //_/ / / / __/ 
 -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /    
--________/_/ |_/_/|_/_/|_|____/___/   

Keep in mind that Grafana uses a ephemeral port number each time you restart your application, so check the logs to see which port is currently being used. In this example, the Grafana endpoint is grafana.endpoint=http://localhost:55786

Now, let’s explore the specific observability features available for LangChain4j components : Traces and Metrics.

Traces

Traces, by contrast, are detailed narratives of individual requests. They let you follow a single call from input to output.

LangChain4j integrates with OpenTelemetry using Quarkus, so each AI method, guardrail or tool invocation creates a span. These spans form a trace, letting you:

  • Pinpoint where time is spent
  • Diagnose why a specific call failed
  • Understand dependencies in complex agent workflows

The parent span handles the HTTP request, but the real insight comes from ai services spans, showing OpenAI calls, guardrails execution and more.

Traces are invaluable for debugging, post-mortems, and uncovering the root cause of failures even when metrics appear normal.

Time to practice! we should add the OpenTelemetry dependency to our project to make traces available for Grafana :

Maven :

<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-opentelemetry</artifactId>
</dependency>

Gradle :

implementation("io.quarkus:quarkus-opentelemetry")

Rerun the application and go to Grafana UI.

Let’s demonstrate this by navigating to the Grafana UI and clicking on “Explore.” This section allows you to query data across all available data sources. Select the “tempo” data source to run trace queries.

Enter the following TraceQL query:

{ true }

This will display all traces from previous executions of our AI Java Developer application.

You can see your traces. If you click on a Trace ID (in blue), you can expand the row to view more detailed information.

What’s going on here? You can see the API POST method called with a 200 OK response. It took 3.76 seconds to complete, and you can view all spans related to this request.

If we zoom in on the trace visualization, we can clearly see the sequence of Guardrail executions exactly as we explained previously in the Guardrail chapter, and the duration of each execution :

When we invoke our POST endpoint, the writeCode() is called, Quarkus first ran CodegenInputGuardrail, followed by CompletenessInputGuardrail which uses another AI service. I know this because this guardrail calls the OpenAI API endpoint (POST /chat/completions). After these input validations, the main OpenAI API generates our Java code. Finally, CodegenOutputGuardrail and CodeStyleOutputGuardrail run sequentially to verify thi generated code.

NOTE: You can see logs for each call by clicking on the “LOG” icon to the right of each span.

Metrics

Metrics provide quantifiable, aggregated data about what’s happening over time in your system. They function as your application’s vital signs—like heartbeat, pulse, and blood pressure. In LangChain4j with Quarkus, key metrics include:

  • Call durations per AI method
  • Number of successful or failed invocations
  • Performance bottlenecks and throughput

These metrics help you monitor important trends such as Are response times increasing?, What’s the frequency of method calls? andWhich prompts are failing consistently?

With metrics, you can build effective dashboards and set up alerts, perfect for monitoring performance in real time and optimizing your system.

Let’s see this in action using our AI Java developer application !

First, add the Micrometer dependency to our project to make metrics available through Prometheus for Grafana :

Maven :

<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-micrometer-registry-prometheus</artifactId>
</dependency>

Gradle :

implementation("io.quarkus:quarkus-micrometer-registry-prometheus")

And, you know the music, rerun the application and go to Grafana UI.

Let’s create a Gauge visualization to monitor our application’s OpenAI estimated usage costs in real time.

Based on OpenAI’s pricing for gpt-4o:

  • $5.00 per 1M input tokens
  • $20.00 per 1M output tokens

This translates to $0.000005 per input token and $0.00002 per output token.

Now let’s implement this cost tracking using a Grafana Dashboard Gauge:

Now let’s create a visualization using Prometheus as our data source:

In the top right corner, click on “Timeseries” to change the visualization type and select “Gauge.” Next, in the queries section, switch from “Builder” to “Code” and paste the following query :

sum(gen_ai_client_token_usage_total{gen_ai_token_type="output"} * 0.00002) + sum(gen_ai_client_token_usage_total{gen_ai_token_type="input"} * 0.000005)

After clicking “Run queries,” you’ll see a gauge visualization like this:

If not, try changing the date range of your visualization to “Last 1 hour” or longer.

You can experiment with different types of Grafana visualizations using various langchain4j metrics, which provides a great way to analyze your data.

TIPS: Grafana’s new drilldown feature allows you to explore your Prometheus-compatible metrics without writing queries. Pretty cool, isn’t it? Let’s see how it looks like :

In the Grafana UI side menu, click on “Drilldown” :

You can explore all metrics provided by your application’s Prometheus endpoints (http://localhost:8080/q/metrics)

For example, you can search for “langchain4j” you will get these visualizations

You can choose langchain4j_aiservices_timed_seconds_total you can explore it for more detail.

I’ll leave you to explore this topic further as it falls outside our current scope 🤓

Conclusion

LangChain4j combined with Quarkus provides a production-ready foundation for building AI applications based on three essential pillars:

Guardrails act as safety mechanisms that validate both inputs and outputs when interacting with LLMs. As demonstrated in our application, they can be implemented as first-class beans with input guards checking for unsafe or inappropriate content before it reaches the model, and output guards ensuring generated code meets quality standards.

Testing transforms unpredictable AI responses into measurable, reliable outcomes. The testing extensions we explored allow you to evaluate AI behavior through semantic-similarity scoring, AI-judge assessment, or custom evaluation strategies. This systematic approach catches potential issues during CI/CD pipelines rather than in production environments.

Observability provides crucial insights into your AI application’s performance. Through Micrometer metrics and OpenTelemetry traces integrated with Grafana, you can monitor key metrics like token usage, response times, and costs. Or identify bottlenecks through detailed trace spans.

Together, these capabilities transform LangChain4j from a simple development library into a robust, enterprise-ready framework. By implementing proper guardrails, comprehensive testing strategies, and detailed observability tools as shown throughout this article, you can deploy AI-infused Java applications to production with confidence, security, and complete auditability.

REFERENCES

“Build AI Apps and Agents in Java: Hands-On with LangChain4j” by Lize Raes : https://javapro.io/2025/04/23/build-ai-apps-and-agents-in-java-hands-on-with-langchain4j/

Github hands-on repository : https://github.com/anunnakian/langchain4j-new-features

Quarkus LangChain4j docs : https://docs.quarkiverse.io/quarkus-langchain4j/dev/index.html

Community incubator repo for new integrations: github.com/langchain4j/langchain4j-community

Examples repo: github.com/langchain4j/langchain4j-examples

Awesome LangChain4j examples: github.com/langchain4j/awesome-langchain4j

LangChain4j Discord: discord.com/invite/JzTFvyjG6R

Total
0
Shares
Previous Post

How java changed my life!

Next Post

Best Practices for Writing Clean Code in Java

Related Posts