Greener Code: Sustainable Java Deployments with Native Builds and Knative Serverless on Kubernetes

Marius Stein and Vishal Shanbhag

Data Center Migration- Any techie who has spent a significant amount of time in Information Technology and working with large enterprises has heard of the word Data Center Migration. Quite likely you would have been a part of a migration project to the Cloud. Data Center Migration projects have been around for a long time and long before the word Cloud Data Center burst into the scene. 

Mostly migration projects talk about cost optimization or cost savings. Cost savings is what drives most of these projects among decision makers. There is however a second “Sustainability” aspect as well. Data Centers consume a lot of electricity. Like so much that the world wide consumption of electricity by data centers was estimated to be about 415 TWH in 2022. For context, the total Electricity consumption by the entire nation of France in the same year was about 426 TWH. And this is before the AI boom led by ChatGPT and the like. This usage is only slated to increase. 

So anything that we as technologists can do to reduce the energy consumption of applications would be of interest. It will of course also help with reducing running costs for applications.

In the remainder of this article we will shed light on how three crucial technologies have evolved (namely Java, virtualization and containerization). We will start with a review of the early days of java and virtualization and end with modern concepts like GraalVM native builds and running serverless applications on kubernetes. On this journey we will showcase how a simple (yet enterprise ready) java application deployed as a simple kubernetes deployment can be turned to a native executable deployed as a serverless container on kubernetes with knative. We will fine tune this deployment to achieve sub-second cold start times when scaling up from zero and ultimately conclude how such an architecture can lead to more sustainability.  

History – The 2000s

n the early 2000s, Java was hailed as a language that allowed creation of portable applications. One could write code once and compile it into a Java Byte Code representation. This byte code (aka .class files) could be packaged into a compressed file format (aka .jar) and run anywhere so long as the system had a Java Runtime Environment (JRE) available. 

This meant that developers could create code and run builds on Windows or Mac, the OS that were more popular amongst end users, and still the same package could be used to run the final program on Linux or Unix. This worked so long as the target runtime OS had a compatible JRE available. 

This along with many other factors such as availability of libraries and language features around object oriented programming and memory management allowed the language to reach a large scale adoption. Developers could now describe software data models in terminology similar to real word through use of Object Oriented Features. They could write complex programs without having to worry about keeping track of each and every byte of memory allocated to the program. So long as one followed some memory management best practices the system would take care of allocation and cleaning up. 

While Java offered platform independence, enterprise software was still commonly deployed on physical servers following a one application – one machine approach. Benefits of this approach were a strong isolation between different applications running on different server and the ability to have different OS-Level dependencies in different applications. The obvious drawbacks were scalability and resource overprovisioning. Running one application on one physical server naturally meant that all resources that are available on the server could only be used by the application hosted on it. Resource sharing was not simple with this approach. 

Virtual Machines – JRE and Hypervisors 

But how did Java achieve platform independence? Put simply the java application does not interact directly with the OS. Any interaction with the OS is abstracted through use of a Java Virtual Machine (JVM). For all practical purposes the programmer uses the JVM as if it is an actual Machine. It offers access to machine resources such as File System, CPU and Memory through the Java libraries. However, there is no need to perform any OS specific actions since there is only the need to know how to work with Java system libraries. 

Java programs are compiled into a platform agnostic lower level representation known as the Byte Code. The byte code in itself cannot be run on any system. However a JRE (Java Runtime Environment) will spin up a JVM (Java Virtual Machine) at runtime which knows how to interpret the byte code instructions. 

This is where Just In Time (JIT) compilation comes into play. Not all libraries needed by the application are linked in the Byte Code. Any library that is required by the application is only loaded and linked into the memory when it is used the very first time. 

This basically means that a Java package is a loose assembly of different byte codes required by the program and the JIT compiler does the stitching together and optimization at runtime. 

In a time when hardware procurement was time consuming (months) this portability of Java was helpful because applications could be written irrespective of what hardware they were to be deployed on. So long as the target data center had excess hardware available, one could deploy the application. As a result of this portability, in the decade between 2000 and 2010 we saw rapid adoption of Java.

Another technology that found broad adoption in this timeframe were hypervisors. Hypervisors benefit from the same concept of virtualization as java does, it just applies it on the OS and kernel level and not on the application level. Through virtualization, one physical server could be divided into multiple virtual ones, allowing to share the CPU and memory resources between those virtual servers and thus allowing for a more efficient resource allocation between workloads. 

Virtualization works through simulating the hardware interface on a physical server through a software component called hypervisor. This hypervisor provides a virtualized hardware interface for multiple virtual machines running on one physical server. Each virtual server has its own encapsulated kernel for the interaction between the operating system and virtualized hardware. This allows for efficient resource allocation while strictly separating processes running on different virtual machines, maintaining a similar level of security and encapsulation compared to physical servers. Although virtualization has a positive impact on resource utilization, it still introduces overhead since each virtual machine runs their own OS and kernel which requires resources.

Also when applied on the application level virtualization introduces overhead. Anything running inside a JVM does:

  • consumes more resources than a natively compiled executable. The JVM itself requires a finite amount of memory.
  • has a slower start up time. Again JVM itself requires some time to start up.
  • Initial run of any application features tend to be a little slower than running the same feature a second time. This is due to JIT.

To tackle the problem introduced by JIT, libraries like Enterprise Java Beans and Spring were introduced. They allowed preloading of all necessary libraries through creation of “beans”. Basically any library that was going to be used by the application often would be preloaded into memory into a runtime instance called a “bean”. 

While this improved the execution time of each code path after the application finished startup, it wasn’t without drawbacks. The main drawback was even slower application start up times since in addition to the normal JVM startup the application would also require a “Warm up” time to load its necessary resources. However, the runtime performance is superior since there is no time spent in JIT beyond the startup. Every library (class) that is needed by the application at run time is warmed up and ready to use.

Containers, Serverless and GraalVM native images 

In 2013, Docker popularized a technology called containerization, which has since revolutionized the way applications are built, shipped, and deployed. Unlike virtual machines, containers do not simulate an entire hardware environment. Instead, they share the host operating system’s kernel, allowing for lightweight and efficient isolation of processes. Each container packages the application along with its dependencies and libraries, ensuring that it runs consistently across different environments, from development to production. This approach eliminates the need for a full operating system within each instance, resulting in faster startup times and lower resource overhead compared to traditional virtual machines.

Containers also enable greater scalability and flexibility. With orchestration tools like Kubernetes, organizations can deploy and manage thousands of containers across clusters, dynamically scaling resources based on demand. This makes container technology ideal for microservices and cell based architectures, where applications are broken into smaller, independently deployable components. Additionally, the portability of containers simplifies workflows, enabling developers to focus on writing code without worrying about compatibility issues across different environments.

Another modern development in software engineering is the emergence of serverless computing, a paradigm that abstracts infrastructure management and allows developers to focus solely on writing and deploying code. Although the term “serverless” can have varying definitions, we will define it here as a deployment model for workloads that satisfies the following criteria:

  1. Dynamic Scalability: A serverless workload automatically scales to accommodate demand, ensuring optimal performance without requiring manual intervention.
  2. Zero-Scaling: In periods of no demand, serverless workloads scale down to zero, eliminating idle resource costs.
  3. Abstraction of Server Management: Developers are freed from managing the underlying infrastructure, including provisioning, patching, and scaling servers.

Through Dynamic Scalability and Zero-Scaling we can improve the resource utilization of deployed applications. Applications experiencing little or no traffic can release their allocated CPU and memory resources, making them available for other workloads in the deployment environment. As traffic increases, a serverless deployment environment scales applications up horizontally, to meet the increased demand, ensuring resilience in high workload scenarios without overprovisioning.

While serverless computing was initially popularized by public cloud platforms like AWS Lambda, Azure Functions, and Google Cloud Functions, the paradigm has also made its way into on-premises and hybrid environments through open-source solutions. Tools like Knative or OpenFaaS enable serverless deployment models on Kubernetes clusters, extending the benefits of serverless to organizations already invested in container orchestration.

However, for Apps built within the JIT paradigm, this means that the advantages offered by the JIT are no longer relevant. In fact in the Serverless world they are counter productive. 

For one, with Docker and other containerization options, the application developer can control the full application stack including the OS without needing to worry about the underlying hardware. This means that the portability advantage of JVM startup is no longer necessary since developers can simply choose one target OS runtime inside the container.

Secondly, java still works within a JVM and therefore the time cost of the JVM startup is a real thing to contend with. And if one is using libraries like spring to create a large number of beans for use in monolithic applications, it makes the applications too large to be started and stopped at will. The startup times can range from a couple of seconds to sometimes even minutes. 

Refactoring such applications into several smaller Microservices is already a thing and Java developers around the world have been adopting Microservices for a while. Despite such refactoring, sub second startup times are still a challenge. 

It is not all gloomy though and the Java community is up to the challenge. One of the solutions is actually existent in history already with the older compiled languages like C, C++ which can produce “native” executable binary for the target OS platform. 

Unlike the Java to Byte code compilation these languages compile the code into lower level instructions for the operating system. This binary executable contains the code from the developer’s program, its associated libraries, system libraries and operating system features all linked together into a single optimized binary executable. This process is called Ahead of Time (AOT) compilation. In AOT, every library that is required for the program to run is known “ahead of time” and therefore at run time there is nothing extra to be done (like Class loading). 

In essence the solution to the challenge of reducing startup and warmup time, lay in going back to the history and coming up with a “Java” solution to AOT. We will not dive into too much detail about the inner workings of AOT, but simply mention that the GraalVM offers the solution. 

For the most part GraalVM can work with existing Java code without any modifications. It simply does static code analysis to identify all dependencies used by a Java program and then generates the optimized native library or executable version of the Java Program including all the necessary classes from its classpath dependencies and the JVM. This way anything that is not explicitly used by the Java program is simply left out from the compiled version. 

A native binary thus generated can be potentially much faster than a regular Byte code compile because the startup times are cut down drastically. There isn’t a need to start a JVM, then do Class loading etc. All of the linkages are there in the native binary already and it can get straight to business. 

Of course there are some limitations. GraalVM does static analysis and therefore if the program being compiled uses dynamic features such as loading libraries through Reflection, JNI, Serialization then the static analysis might not pick up all the relevant dependencies. In such a case there is explicit configuration required to inform GraalVM about the additional dependencies that will be used at runtime. The GraalVM documentation does offer tools and support to do this but that detail is beyond the scope of this article. 

Suffice to say here that with the availability of AOT, Java applications can be more easily moved to the cloud and run inside Kubernetes or even AWS Lambda runtimes.

In the remainder of this article we will demonstrate how an enterprise ready Spring Boot Java application running on kubernetes can be migrated to a serverless application with knative. We will outline in what scenarios such a migration is feasible and what steps to take. Moveover we will demonstrate common pitfalls and how to overcome them. The demo application used here can be found on github: https://github.com/stein-solutions/java-knative-demo.

The Demo Application

Our demo application is a Java Spring Boot application consisting of a simple REST API allowing us to manage Product- and Category-Entities. Although the application is stateless, meaning that it does not persist any state on a harddrive that is attached to the application, 

it initializes an H2 in memory database with a product and a category table through JPA. Having a data layer leads to startup times that are more akin to real world scenarios. We have used H2 here purely to demonstrate database connectivity. In real applications H2 would likely be replaced by persistent databases like MySQL, PostgreSQL etc.

Besides that, the application integrates with the typical cloud native monitoring ecosystem by offering application metrics in the prometheus format. These can be scrapped by prometheus in regular intervals. The required prometheus configuration will be automatically applied through the use of the kubernetes Prometheus operator and the Service-Monitor Custom Resource.

The demo application is deployed to kubernetes as a Deployment and is accessible via an http-endpoint which is configured through a kubernetes Ingress Resource. There is no autoscaling configured for demo applications.

Creating a native Build on our local machine

After installing the required dependencies for GraalVM, creating a native executable with GraalVM from a spring boot application is straightforward. Recent Spring Boot versions already have a native maven profile and the native-maven-plugin (org.graalvm.buildtools) as part of their parent pom. With the following command our application is built into a native GraalVM executable.

mvn -Pnative native:compile

The resulting executable is an ahead of time compiled native executable of our application that is runnable on our host operating system and architecture. All uses of reflections or JNI calls were registered by the library implementing them. For libraries where this has not been done by the library itself (e.g.: H2 Database), there is a community project (https://github.com/oracle/graalvm-reachability-metadata) offering the required runtime hints. For cases where neither community nor library-provided runtime hints are available, these runtime hints need to be supplied manually by the application developer. One example of such are the runtime hints required for dynamically generated code by Project Lombock (https://projectlombok.org/).

Dockerized builds and execution environments

Next we dockerize the native executable. We suggest using a multistage build, where in a first stage we build the native executable and in a second stage we have the final container image just containing the built executable and possibly its dependencies. For the build stage we recommend using a base image provided by GraalVM. This base image already includes all required dependencies to successfully create a native executable of our application. We only need to instruct the container build process to create the native executable by calling the above mentioned mvn goal. The complete build stage can be seen below:

FROM ghcr.io/graalvm/native-image-community:23 AS builder


WORKDIR /build


# Copy the source code into the image for building
COPY . /build


# Build
RUN ./mvnw --no-transfer-progress -e native:compile -Pnative


RUN chmod +x /build/target/demo

The second stage of our container build process prepares the runtime environment of our native executable. If we use a base image that offers the same OS-Level dependencies as the build stage base image, we only need to copy over the created executable and define an entry point command that tells the container runtime what to execute when starting the container. In this example we use “ubuntu:jammi” as the base image. Moreover, it is good practice to define the ports that our application is listening on here, this however does not directly influence the way the container is executed and only serves descriptory purposes here. The full docker file can be seen below.

FROM ubuntu:jammy


# Copy the native executable into the containers
COPY --from=builder /build/target/demo /usr/local/bin/app


EXPOSE 8080
ENTRYPOINT ["/usr/local/bin/app"]

The resulting container image is around 220mb “heavy”.

Optimizing away

The native executable still contains a lot of OS-level dependencies through the use of ubuntu:jammy as base image. We can further reduce the file size by creating a fully statically linked executable of our application. GraalVM supports this out of the box, we just need to pass the parameters --static and --libc=musl to the build command and the resulting executable is statically linked with musl as standard c library implementation as part of the executable. 

It is recommendable to create a separate maven profile for statically linked builds. Below is the code for that

   <profiles>
       <profile>
           <id>nativelinked</id>
           <build>
               <plugins>
                   <plugin>
                       <groupId>org.graalvm.buildtools</groupId>
                       <artifactId>native-maven-plugin</artifactId>
                       <configuration>
                           <buildArgs combine.children="append">
                               <buildArg>--verbose</buildArg>
                               <buildArg>--static</buildArg>
                               <buildArg>--libc=musl</buildArg>
                           </buildArgs>
                       </configuration>
                   </plugin>
               </plugins>
           </build>
       </profile>
   </profiles>

This code extends the native-maven-plugin configuration to add the required parameter. To execute the build we run mvn -Pnative,nativelinked native:compile. It is important to note here, that the above command will fail on ARM macs. This results in an executable file which contains all required dependencies to be executed on a specific CPU architecture. 

With having all OS-Level dependencies linked into the static executable, we can get rid of our base image for the application container and instead build the container image from scratch. The resulting docker image only contains our executable and no other files or folders. For the embedded tomcat inside Spring Boot to work, we need to create a /tmp directory, but since our scratch base image does not have any tools to create directories we simply copy over an empty directory from the build stage.

FROM ghcr.io/graalvm/native-image-community:23-muslib AS builder


WORKDIR /build


# Copy the source code into the image for building
COPY . /build


# Build
RUN ./mvnw --no-transfer-progress -e native:compile -Pnative,nativelinked


RUN chmod +x /build/target/demo
RUN mkdir /custom-tmp-dir


# The deployment Image
FROM scratch


EXPOSE 8080


# Copy the native executable into the containers
COPY --from=builder /build/target/demo /usr/local/bin/app
# Spring embedded Tomcat fails to start if /tmp is not present
COPY --from=builder /custom-tmp-dir /tmp


ENTRYPOINT ["/usr/local/bin/app"]

The resulting container image is around 90mb “light”.

Deploying on Kubernetes with knative

Knative is an open source solution for kubernetes that allows the deployment of serverless applications that can scale to zero on kubernetes. To deploy an application onto an knative enabled kubernetes cluster the following yaml configuration can be used:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: test-java-knative-demo
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/window: "6s"  # Set a custom stable window
    spec:
      automountServiceAccountToken: false
      containers:
        - image: "ghcr.io/stein-solutions/java-knative-demo:nativelinked-f5b208448747baf69e9afe06f0cac1d0b86a265e" 
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
          resources:
            limits:
              cpu: 125m
              memory: 256Mi
            requests:
              cpu: 125m
              memory: 256Mi

This will create a knative service that dynamically scales based on the amount of http requests that are sent to the application. If there is no http traffic directed at our application, knative will scale down our application to zero replicas, freeing all CPU and memory requests from the environment. In our configuration the time frame after which knative will scale down if no requests are received is 6 seconds (the minimum possible time frame), in production environments it is probably reasonable to choose a timeframe that is longer to avoid cold starts. 

Coldstarts – the bane of serverless and how to solve it

For serverless applications to be feasible, reasonable cold start times are a necessity. When operating user facing APIs, response time of many tens of seconds are unacceptable.. This is where native builds come into play. 

“Normal” (read non-native builds) spring boot applications have quite slow startup times. While running our demo application application on an M2 Pro Macbook, the application startup takes around 1.6 Seconds, however, in more resource constrained environments the startup time easily exceeds 10 seconds and more. When restricting the available CPU for our containerized spring boot application to 200 Milicors (20% of one CPU core), the application needs around 30 seconds until being fully started with knative. Even when allowing for one full CPU it still takes around 10 Seconds and when not restricting CPU usage (on a Azure Standard_D2s_v3 Worker Node – 2 CPU available), we can achieve startup times of around 5 seconds.

The problem with assigning one full CPU or more to our Demo application is that those resources will be blocked for the entire lifecycle of the application and not only for application startup when they are needed. After application startup only around 200 Milicors or less will be needed to answer http requests and during idle time there is almost no need to reserve resources to our application at all. However, our application will still block 100% of the  requested CPU core. 

To address the issue of resource overprovisioning, Kubernetes provides the ability to configure resource requests and limits for containers. Resource requests define the minimum guaranteed CPU and memory that a container will have access to, ensuring stability and performance. Limits, on the other hand, cap the maximum resources a container can use, preventing it from consuming more than its allocated share and affecting other workloads.

By setting a lower CPU request (e.g., 200 millicores) and a higher CPU limit (e.g., 1 full core) for our Spring Boot application, we can strike a balance. This approach ensures the application gets sufficient resources during startup while reducing idle resource usage. The application can briefly utilize additional resources when needed, without permanently blocking them for other workloads. The main drawback of this approach is that the application is not guaranteed the one full CPU core, if there is not enough CPU time available the application will get less (but at least 200 millicores). Moreover, during times with no requests the 200 millicores will be reserved for our application and unavailable to other applications. 

By optimizing resource allocation with Kubernetes and dynamically adjusting CPU requests and limits, we can reduce the need for overprovisioning, thereby limiting the number of CPUs required in the cluster. This approach supports a more sustainable IT infrastructure by minimizing the environmental impact of maintaining excess hardware.

We did tests with our demo application deployed through knative on an Azure Kubernetes cluster. As worker nodes we used Azure Standard_D2s_v3 instances with 2 Cores and 6 GB of memory. The below table shows metrics about the request duration and app start times of cold starts. For each configuration, as defined by CPU and Memory allocated, we measured 80 cold starts. All measurements had the container image of our application already downloaded on the respective worker node. Since not only the application needs to start but also knative needs to route the request to the started pod, the full request takes more time than just the simple app start.

ConfigAvg. Req. DurationAvg. App StartMedian Req. Dur.Median App Start2 sigma Req. Dur.2 sigma App Start
50m / 128Mi3.989s2.765s2.961s2.758s3.291s2.908s
75m / 128Mi2.013s1.887s2.005s1.877s2.170s1.993s
125m / 256Mi1.177s1.021s1.163s1.019s1.32s1.115s
250m / 256Mi1.002s808.958ms1.006s798.205ms1.079s924.783m
750m / 512Mi988.207ms575.536ms1.003s554.031ms1.085s692.777ms
1000m / 512Mi990.191ms530.500ms1.005s524.964ms1.060s594.912ms

When giving 1 full CPU core to the application a cold start takes on average 990ms, however it only takes around 530ms for our application to be started and ready to serve traffic. The remaining time the request is processed by the knative activator component before being redirected to the actual application pod. 

One remarkable thing we found is that we can reduce the CPU and memory allocation to about 125 Millicores (⅛ of a CPU) and 256 mb of RAM before the request time starts to significantly drop. With this configuration we still achieve an average request duration of 1.177 seconds (only around 180ms slower than with one full core). However, the app start time is significantly higher compared to the first configuration. Here the app takes 1.021s on average to start up, almost 500ms more than in the first configuration. This gives us as application developers some wiggle room. App start of around one second will lead to similar request durations as app start of about 500ms seconds.

In configurations with 75 Millcors (1/16 of a CPU), requests take significantly more time. This increase in request duration is mainly driven by an increase in the app start time, only an overhead of about 100ms is introduced by knative request routing.

Limitations

The architecture behind knative only allows autoscaling based on http requests, not based on TCP connections. Because of this, websocket-connection can be initialized but will not be considered for the scaling decisions of knative.

Moreover, for the cold start times to be reasonable, it is necessary that the application is stateless. Although possible, it is not recommended to mount persistent volumes. The mounting process for persistent volumes usually takes time (at least a couple of seconds), this leads to cold start times that would be unacceptable. Even only mounting the kubernetes service account token to the application will increase the average request time of a cold start by about 200ms. 

From a sustainability perspective it is necessary to deploy multiple applications onto our infrastructure before we can see a reasonable improvement in sustainability. Running knative itself on a kubernetes cluster requires at least 6 CPU and 6Gi of memory according to the official documentation. So for our setup to be sustainable, we need to deploy a critical mass of knative services onto our cluster.

Additionally, the load pattern that is handled by our application should be volatile, so that longer periods without requests actually exist. Knative shines for applications that are rarely requested, the more constant the load actually is, the less resources can be saved through this approach. 

Outlook

In this article we have shown that it is possible to run stateless, enterprise-ready java applications as serverless containers on top of kubernetes. The cold starts of those containers are comparable to other serverless technologies like AWS lambda and others. This approach can lead to a more sustainable IT system in certain load scenarios. Applications that only are invoked rarely but require swift response times would benefit from this architecture. 

However, as with everything in IT, this architecture is not a silver bullet solution to all applications and architecture needs. 

References:

https://www.linkedin.com/pulse/managing-forecasting-your-cloud-consumption-prakash-a/

https://www.cesarsotovalero.net/blog/aot-vs-jit-compilation-in-java.html

https://www.statista.com/chart/32689/estimated-electricity-consumption-of-data-centers-compared-to-selected-countries/

https://www.graalvm.org/jdk21/reference-manual/native-image/dynamic-features/Reflection/

Total
0
Shares
Previous Post

The long history of log4j

Next Post

Langchain4J Musings

Related Posts