The Gen AI Iceberg – Java Tooling Edition

Artur Skowronski

From @Autowired ChatClient to Compiling GPU Kernels From Pure Java Bytecode: A Lovecraftian Descent Through Everything a JVM Developer Needs (or Fears) to Build AI-Powered Applications

The “Iceberg” meme is an internet phenomenon that humorously, and sometimes unsettlingly, illustrates levels of knowledge or initiation into a given topic – from simple, widely known facts at the tip of the iceberg to the dark, esoteric depths comprehensible only to the most battle-hardened veterans. Picture an iceberg floating on water: what’s visible on the surface is just the beginning, while the real magic (or nightmare) lurks beneath, in increasingly inaccessible layers.

Personally, I love it. I’ve already done icebergs for the JVM ecosystem, performance, and even AI-on-Java from the platform perspective. But this time, let’s shift the lens. Instead of asking “how does the JVM support AI?”, let’s ask the question every Java developer is actually Googling at 11 PM: “What tools do I actually use to build Gen AI stuff in Java?”

The common narrative says Java developers just slap spring-ai-openai-spring-boot-starter in their pom.xml and call it a day. And for many, that’s true — and that’s fine! But under that single Maven dependency lies a sprawling, chaotic, beautiful ecosystem of tools, SDKs, inference engines, and protocols that go from “my PM approved this” all the way down to “I’m running a LLM inside my JVM process and my heap dump weighs 14 gigabytes.”

The deeper you go, the harder it is to explain what you do at stand-ups.

PS: For using the phrase “it’s just a REST call” unironically at this point, you’d be thrown overboard on this ship! Consider yourself warned.

Buckle up! 🚀


Level 1: The Tip of the Iceberg (The “@Autowired ChatClient” Zone)

Welcome to the surface! The sun is shining, the water is warm, and your biggest drama is choosing between spring-ai-openai and spring-ai-anthropic starters.

This is the world 90% of Java AI developers see – a world where AI is just another bean to inject and every second Spring blog post starts with “Building a RAG application in 5 minutes with…”

Spring AI

TLDR: Spring Magic meets LLMs. @Autowired your way into the AI revolution. If Spring Boot is your religion, this is your AI gospel.

Spring AI is the ultimate expression of “if it exists, Spring will eventually have an abstraction for it.” Launched as a proper 1.0 GA in May 2025, it brings the full power of Spring’s “convention over configuration” philosophy to the world of language models. You don’t build AI pipelines; you inject them. ChatClient, EmbeddingModel, VectorStore – they’re all just beans now, auto-configured and ready to roll.

The beauty is in the portability: switch from OpenAI to Anthropic to Ollama by changing a property, not your code. Add a VectorStore backed by PGVector, Milvus, or Pinecone with a starter dependency. Plug in document ETL, structured output mapping to POJOs (yes, your AI now returns proper Java objects), and observability via Micrometer out of the box. With version 2.0 now on the horizon (built on Spring Boot 4 and Spring Framework 7), the team is pushing into MCP support, agent orchestration, and memory management. Mark Pollack even managed to get AI-generated background music for the release notes. Peak Spring Energy.

LangChain4j

TLDR: The scrappy, LLM-native alternative for those not married to Spring. Explicit composition, Java-first idioms, and surprisingly snappy performance.

If Spring AI says “inject it,” LangChain4j says “build it.” Created by Dmytro Liubarskyi in 2023 and hitting 1.0 GA in May 2025 (the same month as Spring AI – coincidence? I think not), LangChain4j is the Java port of the LangChain philosophy, but with a twist: it actually feels like Java, not like Python code forcibly crammed into type safety.

The killer feature? AI Services – declarative interfaces where you define what you want, and LangChain4j figures out the plumbing. RAG, tool calling, memory, MCP – it’s all there, with broader provider support and vector store integrations than you can shake a pom.xml at. Microsoft and Red Hat both back the project, with hundreds of production deployments. And with the langchain4j-agentic and langchain4j-agentic-a2a modules landing in version 1.3, multi-agent orchestration became a first-class citizen.

The real kicker? Both Spring AI and LangChain4j are production-ready, and the community debates between them are surprisingly civil (by Java standards, at least).

OpenAI Java SDK

TLDR: When you just want to call the API. No frameworks, no magic – just HTTP and JSON, the way Gosling intended.

Sometimes you don’t need a framework. Sometimes you just need to call GPT-4o from your service layer, get a response, and move on with your life. The official OpenAI Java SDK (and its community counterparts) gives you exactly that: typed request/response objects, streaming support, and function calling without any framework opinions about how you should structure your application.

It’s the HttpURLConnection of Gen AI – nobody brags about using it, but everybody’s done it at 2 AM to get a demo working. Pair it with your existing dependency injection framework (or, God forbid, static factory methods), and you’ve got AI integration without a single new annotation on your classpath.

Quarkus LangChain4j Extension

TLDR: LangChain4j, but supersonic, subatomic, and with Dev Mode. Quarkus does AI the Quarkus way.

The Quarkus team, in collaboration with the LangChain4j project, built a first-class extension that wraps LangChain4j in the Quarkus experience: CDI-based @RegisterAiService, Dev UI for testing prompts in real-time, hot reload that actually works during AI development, and native compilation support for those who want their RAG pipeline to start in 50 milliseconds.

The really interesting part is the guardrails system – input/output validation, observability via OpenTelemetry, and semantic testing strategies built right in. Quarkus doesn’t just help you build AI apps; it helps you build production-ready AI apps. And with the MCP server extension turning CDI beans into MCP-compliant tools with a single annotation, the “developer joy” factor is genuinely high.

Microsoft Semantic Kernel for Java

TLDR: Microsoft’s Java AI framework. Enterprise pedigree, cross-language ambitions, slightly trailing in the Java-specific race.

Semantic Kernel, Microsoft’s entry into the AI orchestration space, supports Java alongside C# and Python. It brings concepts like Plugins, Planners, and Memory, and integrates naturally with Azure OpenAI services. The multi-agent architecture ambitions are big.

But here’s the honest truth: in the Java world, Semantic Kernel feels like it’s still warming up. The C# and Python implementations are further ahead, and the Java community has largely rallied around Spring AI and LangChain4j. That said, if you’re in an Azure-first shop, or if your team already uses Semantic Kernel in C#, the Java SDK provides a consistent cross-platform experience. Just don’t expect it to keep pace with LangChain4j’s weekly releases.


Level 2: Just Below the Surface (The “Wait, What’s Under That Starter?” Zone)

Here, you start asking questions. “But how does ChatClient actually call the model?” “What is this MCP thing everyone keeps talking about?” “My architect says we need a vector store – what even is that in Java terms?”

The water is getting colder 🥶. Welcome to the infrastructure layer, where you realize every framework is just orchestrating a set of building blocks you should probably understand.

MCP Java SDK (Model Context Protocol)

TLDR: The USB-C of AI tooling. One protocol to connect any LLM to any tool, and Java has the official SDK.

MCP – the Model Context Protocol – started as an Anthropic initiative, and the Spring AI team jumped on it early, eventually contributing work that became the official Java SDK. The idea is brilliant in its simplicity: standardize how AI agents discover and call external tools. No more bespoke function-calling glue code for every provider.

In practice, you define MCP servers (which expose tools, resources, and prompts) and MCP clients (which consume them). Both Spring AI and LangChain4j have native MCP support. The Java SDK supports STDIO, HTTP SSE, and the newer Streamable HTTP transports. Quarkus even has a dedicated MCP server extension that turns your CDI beans into MCP tools with annotations.

Why does this matter? Because MCP is becoming the lingua franca for AI-tool integration. Your Java backend exposing a database query as an MCP tool can be consumed by Claude, by a Spring AI agent, by a LangChain4j service, or by any MCP-compliant client. Write once, integrate everywhere – now that’s a Java promise worth keeping.

Ollama (from the JVM Side)

TLDR: Local LLMs without leaving localhost. Both Spring AI and LangChain4j have starters for it. Your dev environment just got an AI backend.

Ollama is the darling of the local LLM movement – ollama run llama3 and you’ve got a model running on your machine. But for Java developers, the story is even better: both Spring AI (spring-ai-ollama-spring-boot-starter) and LangChain4j (langchain4j-ollama) have first-class integration, meaning you can develop your AI application entirely offline.

The real power move? Testcontainers. Spin up an Ollama container in your integration tests, load a small model, and actually test your RAG pipeline against a real LLM. No API keys, no rate limits, no “works on my machine” excuses (well, fewer of them). For dev/test environments in enterprises that can’t send data to external APIs, Ollama via Docker is the answer that makes compliance teams stop twitching.

Deep Java Library (DJL)

TLDR: Amazon’s engine room for AI inference in Java. Spoiler: Spring AI is basically DJL in a Spring Boot trench coat.

You peek into the transitive dependencies of your Spring AI project and find ai.djl. Surprise! DJL (Deep Java Library), Amazon’s open-source project, is the actual workhorse doing inference in many Java AI setups. It’s an “agnostic engine” – an abstraction layer that lets you write one codebase and swap backends between PyTorch, TensorFlow, MXNet, and ONNX Runtime.

DJL has its own Spring Boot Starter, a clean Java API for model loading and inference, and native support for Hugging Face model repositories. It handles the dirty work of tensor manipulation, model format conversion, and memory management so that higher-level frameworks don’t have to. If Spring AI and LangChain4j are the face of Java AI, DJL is the backbone nobody talks about at conferences.

Vector Stores (PGVector, Milvus, Chroma, Qdrant…)

TLDR: Where your embeddings live. The “database” in RAG, but nobody taught this in your university’s database course.

RAG without a vector store is like Spring Boot without a database — technically possible but existentially pointless. The Java AI ecosystem has converged on a surprisingly rich set of vector store integrations: PGVector (for the “just add an extension to Postgres” crowd), Milvus (for the “we need billion-scale vector search” crowd), Chroma (for the “I saw it in a Python tutorial and want the Java equivalent” crowd), and at least a dozen more.

Both Spring AI and LangChain4j provide portable abstractions with metadata filtering, so switching between stores is mostly configuration. The real gotcha? Understanding that embedding quality matters more than store choice. You can have the fanciest Milvus cluster in the world, but if your embeddings are garbage, your RAG is garbage too. GIGO transcends paradigms.

Hugging Face Tokenizers (via DJL)

TLDR: The tokenization layer nobody thinks about until their prompt mysteriously gets truncated at 4,096 tokens.

Here’s a dirty secret: every time you send text to an LLM, it gets broken into tokens by a tokenizer. The tokenizer determines how your text maps to the model’s vocabulary. Use the wrong one and your carefully crafted prompt becomes numerical nonsense.

DJL provides HuggingFaceTokenizer – a Java binding for the Hugging Face tokenizers library – letting you tokenize locally before sending requests, count tokens accurately, and avoid the “your prompt was too long” surprise. It’s not glamorous work, but if you’re building production RAG systems, accurate tokenization is the difference between “works in demo” and “works in production.”


Level 3: The Depths (The “My pom.xml Has An ONNX Dependency” Zone)

The descent is complete. Sunlight is gone. You start caring about “inference latency” and “model format compatibility.” You understand why running a model inside your JVM process might actually be a good idea. You’ve read the ONNX Runtime Java docs and thought “this isn’t that bad, actually.”

Here, you stop asking “Which framework should I use?” and start asking: “Can I skip the HTTP call entirely?”

ONNX Runtime Java

TLDR: The Rosetta Stone of model formats. Train in Python, run natively in Java, skip the REST middle-man.

ONNX (Open Neural Network Exchange) is the “Esperanto” of machine learning – a universal model format that lets you train in PyTorch or TensorFlow and run inference in any language that has an ONNX Runtime. Java happens to have an excellent one.

The onnxruntime Java library lets you load .onnx model files and run inference directly in your JVM process. No external servers, no HTTP latency, no serialization overhead. For use cases like text classification, toxicity detection, embedding generation, or named entity recognition, ONNX in Java is blazingly fast and terrifyingly efficient. Combined with DJL’s tokenizer support, you can build a complete inference pipeline – from raw text to model output – entirely in Java.

The workflow? Data scientist trains model in Python → exports to ONNX → you load it in Java. Clean handoff, no Python in production. This is the bridge between worlds that actually works.

Jlama

TLDR: A full LLM inference engine written in 100% pure Java. Yes, really. No JNI, no C++, no Python. Just Java 20+, the Vector API, and audacity.

Jlama is the project that makes people do a double-take. Created by Jake Luciani, it’s a complete LLM inference engine written entirely in Java, leveraging the Vector API (Project Panama) for SIMD-level performance. It loads GGUF and SafeTensors model files, supports Llama, Mistral, Gemma, Qwen2, and more, and runs them directly in your JVM.

The implications are wild: embed a model inside your Java application. Same process, same classloader, same monitoring. No Ollama server, no Docker container, no REST call. Just new LlamaModel(...) and go. Jlama integrates with LangChain4j, has a Quarkus extension, supports quantized models (Q4, Q8), and even does distributed inference across multiple JVM nodes.

Is it as fast as llama.cpp? No – C++ with CUDA will always win the raw benchmark game. But for Java-native deployment, edge computing, air-gapped environments, and the sheer developer experience of “everything is Java” – Jlama is genuinely revolutionary. This is the llama.cpp of the JVM world, and the Vector API in JDK 25 will only make it faster.

Testcontainers + Ollama (AI Testing)

TLDR: Integration tests that actually talk to an LLM. Testcontainers makes it real, reproducible, and CI-friendly.

Here’s a question that haunts every AI application team: “How do you test this thing?” Mocking LLM responses is fragile and doesn’t catch prompt regressions. Calling a real API in CI is expensive and flaky. Enter Testcontainers with Ollama.

Spin up an OllamaContainer, pull a small model (like tinyllama or phi), and run your RAG pipeline against a real model in your JUnit test. The result is deterministic enough to catch regressions, real enough to validate your prompt engineering, and cheap enough to run on every PR. Both Quarkus (via Dev Services) and Spring Boot (via Testcontainers auto-configuration) support this pattern natively.

The Quarkus LangChain4j extension even has built-in semantic testing strategies — compare actual LLM output against expected output using embedding similarity or an “AI judge.” It’s not perfect (nothing with LLMs is), but it’s light-years better than “I clicked it manually and it looked fine.”

LangChain4j-CDI (Jakarta EE Integration)

TLDR: AI for the Jakarta EE world. Because not everything runs on Spring Boot, and that’s okay.

The JVM ecosystem isn’t just Spring. There are WildFly installations, Payara clusters, and Open Liberty instances quietly running the world’s banking systems. LangChain4j-CDI (originally SmallRye-LLM, donated to the LangChain4j umbrella) brings AI capabilities to Jakarta EE via CDI.

MicroProfile Config for model configuration, MicroProfile Fault Tolerance for retries and circuit breakers on LLM calls, MicroProfile Telemetry for OpenTelemetry-based observability — it’s the full MicroProfile treatment for AI workloads. @RegisterAiService works just like in Quarkus, but on any CDI 4.x runtime. For enterprises that can’t (or won’t) migrate to Spring Boot, this is the on-ramp to Gen AI that doesn’t require rewriting the platform.

DevoxxGenie / JetBrains AI Assistant

TLDR: AI coding assistance inside your Java IDE. When you want AI to help you write the code that calls AI.

Meta moment: you’re using an AI tool to help you build an AI-powered application. DevoxxGenie (the IntelliJ IDEA plugin created by Stephan Janssen) and JetBrains’ own AI Assistant bring LLM-powered code generation, explanation, and refactoring directly into the IDE where Java developers live.

DevoxxGenie is particularly interesting because it supports local models via Ollama and Jlama — meaning your code completions can run entirely on your hardware. No data leaves your machine, no API costs, just you and a local model arguing about whether that Stream pipeline is readable. For enterprises with strict data governance, this is the AI coding assistant that legal actually approves.


Level 4: The Abyss (The “I’m Running CUDA From My JVM” Zone)

Here, you lose all sense of normal development workflows. You know what MemorySegment does and you have opinions about Arena lifecycle management. You’ve read JEP 489 and know why Float16 matters for AI. Your colleagues think you’re building a REST API; you’re actually wiring GPU kernels.

The depth crushes your mind, yet you can’t stop descending – something is calling you, something closer to the essence of the virtual machine than yet another CRUD (or another chatbot wrapper).

TornadoVM

TLDR: Write standard Java. Run it on a GPU. No CUDA, no JNI, no native code. Just annotations and black magic.

TornadoVM, from the University of Manchester’s Beehive Lab, is an approach from another dimension. It extends OpenJDK/GraalVM with a compiler that takes standard Java code — annotated with task graphs – and compiles it Ahead-of-Time directly to GPU kernels (OpenCL or Nvidia PTX). You write matrix multiplication in Java; TornadoVM runs it on your GPU.

For AI workloads, this means: write your inference logic, your embedding similarity calculations, your attention mechanism in pure Java, and run it on heterogeneous hardware without ever touching JNI or C++. It’s “write once, accelerate everywhere” taken to its logical extreme.

Is it production-ready for all workloads? Not yet. But the trajectory is clear, and the research is genuinely impressive. When combined with the Vector API and Project Panama, TornadoVM represents a possible future where Java doesn’t just call AI – it is the AI runtime.

GPULlama3.java

TLDR: LLM inference on a GPU, written in pure Java. The TornadoVM team decided not to wait for the OpenJDK committees.

While Project Babylon and HAT are being designed in OpenJDK committee meetings (more on those in a moment), the TornadoVM team shipped GPULlama3.java – the first Java-native implementation of Llama3 with automatic GPU acceleration. Built on top of Alfonso Peterssen’s original Llama3.java (a brilliant CPU-only implementation) and TornadoVM’s GPU compilation, it proves the concept: you can run LLM inference on a GPU from pure Java.

It’s a proof of concept, not a production engine. But it’s the kind of proof of concept that makes JVM architects sit up straight and think “wait, maybe we don’t need that Python sidecar after all.” The performance gap with C++ is still real, but it’s narrowing with every JDK release.

Project Panama FFM for Native AI Libraries

TLDR: The new JNI. Call CUDA, cuDNN, and ONNX Runtime C APIs directly from Java without writing a single line of C.

The old evil was JNI (Java Native Interface) – slow, unsafe, and requiring C glue code that made everyone miserable. Project Panama’s Foreign Function & Memory (FFM) API, finalized in JDK 22, replaces it entirely.

For AI specifically, FFM is critical: it lets you allocate off-heap memory (MemorySegment, Arena) for huge tensors and model weights that can’t (and shouldn’t) live on the GC-managed heap. It lets you call cuda.h functions directly from Java via jextract-generated bindings. It lets you share memory buffers between Java and native libraries without copying.

ONNX Runtime Java, DJL, and Jlama all benefit from FFM under the hood. It’s not a tool you “use” directly as an AI developer — it’s the foundation that makes everything else possible. Like garbage collection, you don’t think about it until it’s not there. And then you think about it a lot.

Vector API (JEP 508, Incubator Round 10)

TLDR: SIMD in pure Java. Tell the CPU “do eight float operations at once” and watch your embeddings fly.

AI math is 99% matrix operations. Java loops are naturally scalar. The JIT compiler tries to auto-vectorize them, but it often fails on complex patterns. The Vector API is your manual override: explicitly tell the JVM to use SIMD (Single Instruction, Multiple Data) CPU instructions for parallel numeric computation.

For AI workloads, this means: embedding distance calculations, cosine similarity, dot products, softmax – all dramatically faster. Jlama is built on top of it. DJL benefits from it. Any Java code doing numeric computation for AI gets a massive boost.

The catch? It’s been in incubator since JDK 16, and we’re now at the tenth round (JEP 508). The dependency on Project Valhalla’s Float16 value type keeps pushing the timeline. But the JDK 25 release brings it closer than ever, and when it finally goes GA, the performance ceiling for pure Java AI inference will jump significantly.


Level 5: The Void (The “I Read JEP Drafts For Fun” Zone)

You are beyond help now. You attend Project Babylon mailing list discussions. You know what “Code Reflection” means and why it matters for GPU compilation. Your stand-up updates sound like research papers. “What did you do yesterday?” “I was analyzing the implications of JEP 489’s Float16 primitive class on Vector API specialization for transformer attention head computation.” Silence in the room.

Welcome to the bottom. No light reaches here. Only TRUE Machine Learning stuff.

Project Valhalla’s Float16

TLDR: A true 16-bit float type for Java, with zero overhead. The missing primitive that AI workloads have been begging for.

AI loves float16. It’s faster, uses half the memory of float32, and most model weights don’t need full precision anyway. But Java doesn’t have a native float16 type. The current workaround? Wrap a short and pray. No type safety, no operator overloading, no Vector API specialization.

Project Valhalla – Java’s decade-long quest to fix the primitive/object schism – will deliver “primitive classes” that allow a Float16 value class wrapping a short with zero performance overhead and full type safety. When this lands, the Vector API can specialize on it, Jlama gets faster, ONNX Runtime interop becomes cleaner, and every numeric AI computation in Java becomes more efficient.

The catch? It’s Valhalla. We’ve been waiting since 2014. Tonight, we (rather not) dine in Valhalla — but this time, the appetizers are closer than they’ve ever been.

Project Babylon (Code Reflection)

TLDR: Java gets the ability to inspect and transform its own code at a semantic level. The GPU compilation endgame.

Project Babylon is the new mythical OpenJDK project that could change everything for Java AI. It introduces “Code Reflection” – not the old java.lang.reflect that looks at classes and methods, but a deep, structured representation of what your Java code actually does. Method bodies, control flow, data flow – all available as an analyzable tree.

The AI implication is mind-blowing: write your model (a neural network, an attention mechanism, a loss function) in standard Java. Babylon’s Code Reflection analyzes that code and automatically translates it into optimized GPU kernels. No Python. No external model files. No JNI. Just Java → GPU.

It’s the endgame: if Valhalla gives us the right data types, and Panama gives us the hardware bridge, Babylon gives us the compiler that turns Java code into accelerator code. The Holy Trinity of Java’s AI future.

HAT (Heterogeneous Accelerator Toolkit)

TLDR: Babylon’s chariot into battle. The concrete implementation that takes Code Reflection and turns it into GPU kernels via Panama.

HAT shouldn’t be considered a separate project – it’s the chariot Babylon rides into battle. It takes Babylon’s Code Reflection output and, using Panama’s FFM API as the transport layer, compiles Java methods into OpenCL or PTX kernels that run on GPUs, FPGAs, and other accelerators.

No JNI, no ad-hoc glue: Babylon describes the computation, HAT turns it into kernels, and Panama wires the whole thing into whatever silicon you have underneath. It’s three once-separate OpenJDK efforts snapping together into a single pipeline that lets plain Java reach all the weird and wonderful hardware in your data center.

Production-ready? Not yet. But the research implementations exist, the JEPs are progressing, and the trajectory is unmistakable. The JVM is building toward a future where “Java for AI” doesn’t mean “Java calling Python” – it means “Java being the AI runtime.”

Llama3.java / Pure Java Model Implementations

TLDR: When someone says “can you implement LLM inference from scratch in Java?” and the answer is “hold my espresso.”

Alfonso Peterssen’s Llama3.java proved something important: you can implement the entire Llama3 inference stack in a single Java file. No frameworks, no dependencies, just the Vector API, some clever memory mapping, and a deep understanding of the transformer architecture. It’s educational, it’s surprisingly performant, and it spawned GPULlama3.java.

These “from scratch” implementations matter because they demonstrate that the JVM’s modern features – Vector API, Panama, value types – are sufficient to express AI workloads natively. They’re not production tools; they’re proof that the platform is ready. When Valhalla, Babylon, and HAT converge, these experiments will have been the rehearsals.


And that’s our iceberg, folks. From @Autowired ChatClient all the way down to hand-crafting GPU kernels from Java bytecode, the Gen AI tooling ecosystem in the JVM world is deeper, richer, and more exciting than the “Java can’t do AI” crowd would have you believe.

The surface is friendly and productive — you can build a production RAG application in an afternoon with Spring AI or LangChain4j, and that’s genuinely impressive. But the depths? The depths are where the JVM is quietly reinventing itself as a platform that doesn’t just call AI services, but runs AI workloads natively.

Python built the research lab. Java is building the factory. And the factory is starting to build its own machines.

Just don’t say I didn’t warn you about the pom.xml file length.

Now you’ve got enough material on Gen AI tooling in Java to keep you busy for the next month – consider it your honorary PhD on the topic! 

Total
0
Shares
Previous Post

Virtual Threads Meet AI: Java Concurrency in the Age of Intelligent Systems

Next Post

Introducing skills.boxlang.io — The Open Agent Skills Ecosystem for BoxLang & the Ortus World

Related Posts