DE EN

Building a better Mouse Trap – web crawling with Temporal

Steve Poole

For various reasons, I’ve been looking at Temporal lately. It’s one of those open-source projects with some great users – think Netflix, Uber, etc.- yet has less visibility in the Java community than expected. 

I hope to raise its profile here.

Before I talk about Temporal, I’ll cover my particular use case.  Stay with me; this needs to be detailed.

Background

A side project I have is to create a Java API comparison tool that can be used to check if moving from version A to version B of a dependency will have issues.  I’m looking beyond the simple class-to-class comparison, I intend to check for a more holistic view to find runtime and compile time problems that might occur within the dependency tree if you upgrade something. 

To test this tool, I need binaries—lots and lots of binaries. If I know the coordinates of a dependency, I can get those from Maven Central, but as you can imagine, it’s all manually intensive —discovering dependencies, versions, etc., is laborious. So, like all developers, I wrote a tool.  

Scanning a Java Repo 

There are many components on Maven Central—MvnRepository says 14,904,773 at the time of writing—and many of these haven’t been updated in a while. The stats say only around 25% of open-source Java projects are actively maintained.

My tool scans Maven Central and finds all the  maven-metadata.xml files to create an index.  That requires some HTML parsing, some local caching, etc. The killer considerations are three things. 

  • That I’m always a very good citizen and don’t ever inadvertently carry out a denial of service attack.
  • That it’s an optimal indexing process. I need to focus the scanning on those components that are being updated and only visit the older ones occasionally. 
  • The process is restartable. The little Linux machine I run this on has regular patch updates, and my internet connection can fail. I might also lose power. Ultimately, I don’t want to have to start from the beginning every time I need to restart.

Current Approach

Being a good citizen is straightforward. I control the number of HTTP calls per minute and adjust the delay between each to match the cadence. For example:

Instant now = Instant.now();

long timeSinceLastRequest = Duration.between(lastRequestTime, now).toMillis();

if (timeSinceLastRequest < minimumIntervalMillis) {

    long waitTime = minimumIntervalMillis - timeSinceLastRequest;

    Thread.sleep(waitTime);

}

lastRequestTime = Instant.now();

I’m lucky that the HTML I returned from the repo has a last updated timestamp. Everything else is a challenge.  

Fortunately, the HTML from the repo includes a last-updated timestamp, but everything else is challenging. Initially, I scan the entire repo to establish the current state, then revisit components over time, starting every three months and extending to a year or more, based on release patterns and the likelihood of updates.

Remember, like all web crawlers – I’m looking to optimise for updates to existing content and find new content

Failure is not an option – it’s a certainty. 

How do I store the state for a system that’s going to be running continuously and that will inevitably fail.   How do I code for failure? How do I create all the various tasks that run at different times and have different jobs to do?

The simple answer is that  I started with a database.  There are a few tables. One contains the list of components and their last updated date, and the others contain partial paths on the repo and the last date it was updated.  The first table gives me a list of actual components; the second helps with discovering broader updates like new components being added in a new namespace.   

Now all I need is the code.  

Building and maintaining the index takes the following.

Build Top Level Domain Info: Visit the repo at the top level and collect the root links—these match (mostly) top-level internet domains—com,org,dev,tv, etc. Then, grab the next level of directories for every root link found and do the same thing.  Now I have a record in my database for every two level domain i.e com.foo, org.bar etc and when they were discovered.  

This task is run at setup but afterwards takes a back seat and runs maybe once every three months.  

Expand domains to three levels: Finding entries like com.foo isn’t very useful—most of the components have longer group IDs. So, this task scans the two-level domain table (sorted by discovery date) and creates a third level of detail. 

This task runs more often, maybe once a month.  

Deep dive scan: This task takes the first 100 records from the three-level domain data and, based on the last updated info, looks for actual components (especially new ones)  by looking for links to ‘maven-metadata.xml.’  While doing it, keep track of the location (not every component has one of these files, and there are instances where the metadata is for maven plugins, not components). Rule of thumb: If you find a directory with a pom file listed, you’ve gone too far. The task parses the metadata file, pulls out the coordinates and the version info, and updates the database. Remembering to add in both the last-updated data from the HTML, the timestamp for when this happened and a timestamp for the next visit.

This task runs interleaved with the next one. 

Component Update: This task looks at the top 1000 component records in the database by last update and assesses whether they should be revisited.  The objective is to bubble up those components that release more often (or at all) and make a guess when they will do so next time.  

This task runs every other day and is interleaved with the scan task.

More complicated than you thought?

The system is designed this way because of how long it takes to run and the need to be able to recover failures without too much wasted time or unnecessary visits to the website. Being a good citizen when visiting certainly slows it down, but even so, we’re talking about keeping an eye on millions of binaries.  That means that any search tasks will take days or weeks to run. And then we do it again and again. My design is complex because it’s a trade-off between progress and resilience.  I could do a crude web crawl, do a depth-first search and then, in 6 months, do it again. But that means, on average, it’ll take three months to spot anything new.  

My challenges are getting flexible, optimal throughput in a reliable, restartable way. Although my use case is pretty niche, the problem is not. Long-running processes are difficult to get right, and don’t get me started on multi-threaded, multi-client requirements. Luckily, I don’t have those issues, but others do.

What would be nice …

One big issue with my existing approach is that although each task is recoverable, it is not updateable. By this, I mean everything can restart, mostly picking up where it left off, but I can’t affect the order of the data in their process once it has started. So, if I wanted to add a new component to scan by hand, I had to restart the relevant task.  

I want something more sophisticated to schedule tasks asynchronously based on changing heuristics. When my top-level task finds a new domain entry, I want it to trigger a deeper scan that can be interleaved with other tasks rather than waiting until the task is complete. It would be nice if it were more event and data-driven. I’ve steered clear of mixing these tasks because of concerns about keeping track of the state and the inevitable restart process. 

Let’s look at Temporal. 

I asked ChatGPT about the project.  It said 

“Temporal is a durable, distributed workflow orchestration system designed to help developers build reliable, scalable, and fault-tolerant applications. For Java developers, Temporal provides a powerful framework to manage stateful, long-running workflows, handling complex business logic while taking care of retries, timeouts, failures, and state persistence in a simple and developer-friendly way.”

As you can imagine, my eyes lit up. 

Temporal has more moving parts than my original setup—Workflows, Activities, Workers, and Clients—but I saw the potential once I understood their interactions. Temporal’s workflows define long-running processes that orchestrate activities, ensuring these workflows are durable, deterministic, and recoverable.

Conceptually, Workflows define long-running business processes that orchestrate activities and other workflows. Temporal guarantees that workflows are deterministic, durable, and can survive failures or restarts.  In my case, workflows will correspond to some of my original tasks. 

Temporal needs to know what’s going on to honour these promises, and it’s through these various entities that this is achieved.  One thing to get immediately is that Temporal doesn’t run your code.  Your code interacts with Temporal. It orchestrates what your code does and, through its various elements, keeps track of your application’s ‘business’ state. That’s key to its reliability and recovery. 

In my case, this is keeping track of essential changes on Maven Central.

Workflows 

A workflow orchestrates the flow of activities by calling them, determining their execution order, handling failures, waiting for their completion, and even scheduling them for parallel execution. Activities are just building blocks, and they don’t communicate with each other directly. The workflow ensures that the activities are properly coordinated.

I have two workflows: the top-level scanning of the repo looking for new components and the second one that is used to deep-dive into a particular component looking for updates.


@WorkflowInterface
public interface RepoScannerWorkflow {

    @WorkflowMethod
    void fetchLinks(String serverUrl);

    @SignalMethod
    void addExtraLinks(List<String> extraLinks);

    @QueryMethod
    String getStatus();
}

The annotations are from Temporal and help explain what I want to happen. The fetchLinks() method is the main trigger, while the addExtraLinks() method allows me to inject new prioritised links to scan while the workflow runs. This is a big plus for my use case!

I’ll let you guess what the getStatus() method is for 🙂

The workflow implementation has the nuts and bolts of the process. fetchLinks() looks like this.

 @Override
    public void fetchLinks(String serverUrl) {
        // Fetch links from the remote server
        List<String> fetchedLinks = remoteLinkServiceActivity.fetchLinks(serverUrl);
        linkQueue.addAll(fetchedLinks);

        // Process the links in the queue
        processLinkQueue();
    }

What is important to note is the use of an Activity to do the heavy lifting. Details in a moment, but here’s where Temporal adds more value. In the event of a failure, Temporal ‘knows’ what this activity did last time and can replay it. So the call to the web server is not needed (<boggle/>)

You can guess what remoteLinkServiceActivity does. 

processLinkQueue() code below takes that data and triggers the asynchronous processing via the second workflow. See how there is an indirect call to processLink() with the provided data. Note the PARENT_CLOSE_POLICY_ABANDON value, which means the child tasks will complete even if the top-level scanning is finished.

    // Start a child workflow to process the link

            LinkProcessingChildWorkflow childWorkflow = Workflow.newChildWorkflowStub(LinkProcessingChildWorkflow.class, ChildWorkflowOptions.newBuilder()
                    .setParentClosePolicy(ParentClosePolicy.PARENT_CLOSE_POLICY_ABANDON)
                    .build());
            Async.procedure(childWorkflow::processLink, link);

Hopefully, you begin to understand how, by weaving your code through Temporal, you get the async nature and recovery built-in!

I still have my database, but the code to populate it now gets driven through Temporal. When I trigger the initial workflow, I get a level of reliability, recovery, and flexibility that I didn’t have before.

Summary

I obviously can’t do justice to something as sophisticated as Temporal in a short article but I do now understand why companies like Uber and Netflix use it.

What impressed me most about Temporal is how easily I could get started. Initially, I thought the added complexity would be a hurdle, but I’m not writing more code—it’s just written differently. Thinking in asynchronous, incremental steps made it more accessible, and the built-in UI has already helped me when I made mistakes.

Overall, I’m excited to have found Temporal and look forward to exploring it further. I’ll keep you updated.

Total
0
Shares
Previous Post

Developer Relations as Terraforming: Cultivating Ecosystems for Success