A Two-Week Journey on EuroHPC Leonardo

Two weeks ago, we announced our access to the EuroHPC AI Factory Playground program. Today we want to share what we have accomplished and what we have learned.

First, we want to express our gratitude to the EuroHPC Joint Undertaking. The GPU hours on Leonardo gave us the right to run experiments that would have been impossible on our own infrastructure. Beyond the compute resources, this experience introduced us to the EuroHPC ecosystem and showed us how European SMEs can participate in cutting edge AI research. It also gave us the opportunity to contribute back to the open source community. The improvements made during this project are now available in DKSplit v0.3.1. This is exactly what the Playground program is designed for: letting SMEs explore ideas at scale while building capabilities that benefit the broader community.

Our initial choice of the BiLSTM architecture came from earlier production tests. Given the throughput we needed to process hundreds of thousands of domain names daily on modest hardware, BiLSTM offered the best balance of speed and accuracy within our constraints at the time. EuroHPC’s compute allocation gave us the freedom to explore alternatives we could not test before.

What we were looking for

Before we can analyze whether a domain poses a risk, we first need to understand what it says. That is why improving DKSplit’s accuracy matters to us. Our goal was never to replace BiLSTM with a large language model. We wanted something more specific: to borrow the LLM’s world knowledge without its tendency to improvise, and use that knowledge to enrich the training data our BiLSTM learns from. We also wanted to see how far a fine tuned LLM could go as a direct segmentation tool, even though our main interest was in using it as a knowledge source.

What we did in the past 2 weeks?

Over the past two weeks, we ran extensive experiments and validated many of our earlier hypotheses.

How we measure accuracy

Throughout this post, we report two accuracy numbers. Benchmark is a held out subset of our internal training distribution (a validation set, not a standardized public benchmark). Real world is 1,000 domains randomly sampled from the Newly Registered Domains Database on April 12, 2026, published in our GitHub repo: https://github.com/ABTdomain/dksplit/tree/main/benchmark. The gap between these two numbers is what we care about most.

BiLSTM upgrades

We started by retraining our BiLSTM-CRF architecture on Leonardo, running systematic experiments across multiple configurations: different layer depths, hard negative mining, confidence filtering, and multilingual data balancing.

The result is DKSplit v0.3.1, now available on PyPI and Github.

On our real world test set of 1,000 randomly sampled newly registered .com domains, the improvement looks modest: 85.0% versus 82.8% in the previous version. But the numbers hide qualitative gains. The new model recognizes more brand names correctly. It handles European languages better. It makes fewer embarrassing mistakes on common patterns.

Inputv0.2.xv0.3.1
bestbuybest buybestbuy
databricksdata bricksdatabricks
instacartinsta cartinstacart
mailchimpmail chimpmailchimp
robinhoodrobin hoodrobinhood

These are exactly the cases where world knowledge matters. The BiLSTM learned these patterns from our training data, which was generated using the strongest available LLMs. In a sense, the LLM knowledge is already distilled into the small model.

While we saw clear improvements, errors in certain scenarios remained obvious. Brand names, multilingual phrases, and emerging terms still caused problems. So we moved forward with our plan to experiment with LLM based approaches directly.


LLM experiments

The more ambitious part of our work involved training Qwen3.5 9B on domain segmentation using LoRA.

Our first experiment used 64 rank LoRA with 95,000 training samples. The results were instructive but disappointing. On our benchmark, the model achieved 85.2% accuracy, comparable to our BiLSTM. But on real world newly registered domains, it dropped to 82.8%.

This gap between validation and real world performance is expected: the validation set shares structure with the training data, while the real world set contains daily drift the model has never seen. What we care about is how wide that gap is.

We scaled up to 128 rank LoRA with 5 million samples. Benchmark accuracy reached 90.1%, but real world accuracy stayed at 85.0%, tied with BiLSTM.

Here is a summary of our experimental results:

ModelBenchmark AccuracyReal World Accuracy
DKSplit v0.3.1 87.6%85.0%
Qwen3.5 9B LoRA r128 (5M data)90.1%85.0%
Qwen3.5 9B LoRA r64 (95K data)85.2%82.8%
Gemma 4 31B zero shot72.8%72.8%
Qwen3.5 9B zero shot58.1%58.2%

Note: For zero shot models, benchmark and real world accuracy are nearly identical (58.1% vs 58.2% for Qwen3.5 9B, 72.8% vs 72.8% for Gemma 4 31B). This is expected. Zero shot models have not been trained on either set, so both distributions are equally unseen. The gap between benchmark and real world only appears for fine tuned models, where the benchmark shares more structure with the training distribution.

Over segmentation. Both r64 and r128 tend to split strings into too many pieces. In our real world test set, carlitad becomes carl it ad. The models have learned that more splits are generally better, but lack the confidence to keep unfamiliar strings intact. This contrasts with BiLSTM, which defaults to preserving unknown patterns as whole tokens.

Input corruption. When given a misspelled word like rennaisance, a generative model may silently rewrite it, landing on the standard spelling or yet another misspelling. LoRA fine tuning teaches the model where to split, but does not stop the underlying generative process from altering characters. For segmenting user registered strings, any character change is wrong. These are architectural tendencies of how generative LLMs process text, not bugs that more data or better hyperparameters will fix.


What we learned

The experiments confirmed what we suspected: for pure segmentation, a specialized BiLSTM is hard to beat. But LLMs excel at understanding meaning, even when they fail at exact string splitting. “Where are the word boundaries?” and “What does this domain represent?” are different questions requiring different tools.

In our daily operating scenario, with 200,000+ newly registered domains and more than 800,000 changes every day, every percentage point of accuracy improvement matters. The path forward is not replacement but combination of specialized models at each stage.


What comes next

We will continue to follow our plan while adapting as the AI landscape evolves.

During our experiments, Google released Gemma 4 31B. Without any fine tuning, it achieved 72.8% accuracy on our benchmark, compared to 58.1% for Qwen3.5 9B. The jump suggests that larger models carry more world knowledge relevant to our task.

This does not mean a fine tuned Gemma 4 31B would suddenly solve the architectural issues we observed. The same generative tendencies that caused over segmentation and input corruption in Qwen3.5 9B are likely to appear in any generative model of any size. But a larger, more knowledgeable model may be a better teacher for our BiLSTM. If Gemma 4 31B recognizes more brand names, geographic terms, and multilingual patterns than Qwen3.5 9B, it can produce higher quality training labels, provided we apply strict character-level alignment filters to reject any corrupted outputs before feeding them to the BiLSTM. This is the direction we plan to explore next.

Segmentation alone is not enough to understand what a domain represents. Knowing that appleiphone splits into apple iphone tells us the words, but not what they mean together. We are also exploring multi dimensional labeling: segmentation plus language detection plus category classification. For example:

appleiphoneapple iphone | English | Brand + Product

We will also test different prompting strategies on the same models, to see how much of the behavior we observed comes from the model itself and how much from how we asked it.

Finally, we plan to run a second round of experiments with modern pre-trained Encoder architectures such as DeBERTa-V3 and XLM-RoBERTa with a CRF head. A proper comparison against pre-trained Encoder models is the natural next step and one we did not have time to include in our two weeks on Leonardo.


Acknowledgements

This work uses Qwen3.5 9B (Alibaba Cloud, Apache 2.0) and Gemma 4 31B (Google DeepMind, Apache 2.0). DKSplit v0.3.1 is available now:


Update, April 27, 2026. This post is part of an ongoing series on our EuroHPC Leonardo work. The next update covers a cleaner benchmark, our first DeBERTa-V3 run, and how different segmenters actually fail: DKSplit Update: Cleaner Benchmark, First DeBERTa Run, Different Failure Modes.

We acknowledge the European High Performance Computing Joint Undertaking (EuroHPC JU) for awarding this project access to the Leonardo supercomputer, hosted by CINECA in Italy.

Co-funded by the European Union. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European High Performance Computing Joint Undertaking.