Two weeks ago, we announced our access to the EuroHPC AI Factory Playground program. Today we want to share what we have accomplished and what we have learned.
First, we want to express our gratitude to the EuroHPC Joint Undertaking. The GPU hours on Leonardo gave us the right to run experiments that would have been impossible on our own infrastructure. Beyond the compute resources, this experience introduced us to the EuroHPC ecosystem and showed us how European SMEs can participate in cutting-edge AI research. It also gave us the opportunity to contribute back to the open source community. The improvements made during this project are now available in DKSplit v0.3.1. This is exactly what the Playground program is designed for: letting SMEs explore ideas at scale while building capabilities that benefit the broader community.
What we were looking for
DKSplit serves as a foundational component in our brand threat detection pipeline. Before we can analyze whether a domain poses a risk, we first need to understand what it says. Accurate segmentation is the prerequisite for everything that follows. That is why improving DKSplit’s accuracy matters to us. Our goal was never to replace BiLSTM with a large language model. We wanted something more specific: to borrow the LLM’s world knowledge without borrowing its tendency to “think.”
What do we mean by “thinking”? When given “roguerennaisancefair,” an LLM might output “rogue renaissance fair,” correcting the spelling of “renaissance” even though the input was “rennaisance.” That is helpful in a chatbot. It is a defect in a segmentation tool. We need exact parsing, not helpful corrections.
The ideal model would combine BiLSTM precision with LLM knowledge, but without the LLM’s tendency to improvise. In other words, we were looking for a way to extract the LLM’s world knowledge and freeze it into stable outputs. We wanted to use this knowledge to push our BiLSTM beyond its current limits, not to replace the architecture, but to enrich the training data it learns from.
What we did in the past 2 weeks?
Over the past two weeks, we ran extensive experiments and validated many of our earlier hypotheses
BiLSTM upgrades
We started by retraining our BiLSTM-CRF architecture on Leonardo. This was not just about using faster hardware. We ran systematic experiments across multiple configurations: different layer depths, hard negative mining, confidence filtering, and multilingual data balancing.
The result is DKSplit v0.3.1, now available on PyPI and Github.
On our benchmark of 1,000 randomly sampled newly registered .com domains, the improvement looks modest: 85.0% versus 82.8% in the previous version. But the numbers hide qualitative gains. The new model recognizes more brand names correctly. It handles European languages better. It makes fewer embarrassing mistakes on common patterns.
| Input | v0.2.x | v0.3.1 |
| bestbuy | Best buy | Bestbuy |
| databricks | data bricks | databricks |
| instacart | insta cart | instacart |
| mailchimp | mail chimp | mailchimp |
| robinhood | robin hood | robinhood |
These are exactly the cases where world knowledge matters. The BiLSTM learned these patterns from our training data, which was generated using state-of-the-art LLMs. In a sense, the LLM knowledge is already distilled into the small model.
While we saw clear improvements, errors in certain scenarios remained obvious. Brand names, multilingual phrases, and emerging terms still caused problems. So we moved forward with our plan to experiment with LLM-based approaches directly.
LLM experiments
The more ambitious part of our work involved training Qwen3.5 9B on domain segmentation using LoRA.
Our first experiment used 64 rank LoRA with 95,000 training samples. The results were instructive but disappointing. On our benchmark, the model achieved 85.2% accuracy, comparable to our BiLSTM. But on real-world newly registered domains, it dropped to 82.8%.
The gap between benchmark and real-world performance tells us the model had not truly learned domain segmentation. It had memorized patterns from the training set without generalizing. More importantly, it had not achieved what we actually wanted: reliable overfitting on world knowledge.
We scaled up to 128 rank LoRA with 5 million samples. The accuracy improved, but the fundamental problems remained:
Here is a summary of our experimental results:
| Model | Benchmark Accuracy | Real World Accuracy |
| DKSplit v0.3.1 | 87.6% | 85.0% |
| Qwen3.5 9B LoRA r128 (5M data) | 90.1% | 85.0% |
| Qwen3.5 9B LoRA r64 (95K data) | 85.2% | 82.8% |
| Gemma 4 31B zero-shot | 72.8% | 72.8% |
| Qwen3.5 9B zero-shot | 58.1% | 58.2% |
The LoRA models matched or exceeded BiLSTM on benchmarks, but real-world accuracy tells a different story. More importantly, while the numbers look similar, the error patterns are completely different.
Over-segmentation. Both r64 and r128 tend to split strings into too many pieces. carlitad becomes carl it ad, titascakes becomes tita s cakes. The models have learned that more splits are generally better, but lack the confidence to keep unfamiliar strings intact. This contrasts with BiLSTM, which defaults to preserving unknown patterns as whole tokens.
Input corruption. Both models occasionally alter the input rather than simply segmenting it. Given roguerennaisancefair, the r128 model outputs rogue rennaissance fair, changing “rennaisance” to a different misspelling “rennaissance.” The underlying LLM has strong priors about how words should be spelled, and LoRA fine-tuning does not fully suppress this behavior. For domain analysis, the input must be returned exactly as registered.
These are not bugs we can fix with more data or better hyperparameters. They are architectural tendencies of how LLMs process text. The tokenizer sees characters and wants to form subwords. The language model sees misspellings and wants to correct them.
What we learned
The experiments confirmed what we suspected: for pure segmentation, a specialized BiLSTM is hard to beat.
But we also learned something valuable. LLMs struggle with exact string splitting, yet they excel at understanding meaning. “Where are the word boundaries?” and “What does this domain represent?” are different questions requiring different tools.
The path forward is not replacement but combination. We believe our current hybrid architecture with cascading validation is the right approach. What we can improve is the precision at each stage. With 200,000+ newly registered domains and more than 800,000 changes every day, every percentage point of accuracy improvement matters.
What comes next
We will continue to follow our plan while adapting as the AI landscape evolves.
During our experiments, Google released Gemma 4 31B. Without any fine-tuning, it achieved 73% accuracy on our benchmark, compared to 58% for Qwen 9B. The jump suggests that larger models carry meaningfully more world knowledge. A 31B model, properly fine-tuned, might achieve what we were looking for.
We are also exploring multi-dimensional labeling: segmentation plus language detection plus category classification. For example:
appleiphone → apple iphone | English | Brand + Product
This is where LLM capabilities truly shine. Not in splitting strings, but in understanding what those strings represent.
Acknowledgements
This work uses Qwen3.5 9B (Alibaba Cloud, Apache 2.0) and Gemma 4 31B (Google DeepMind, Apache 2.0). DKSplit is released under Apache 2.0.
DKSplit v0.3.1 is available now:
- PyPI: pypi.org/project/dksplit
- GitHub: github.com/ABTdomain/dksplit
- Hugging Face: huggingface.co/ABTdomain/dksplit
- Hugging Face Qwen3.5 9B:https://huggingface.co/ABTdomain/dksplit-qwen-lora
- Newly Registered Domains Database for real world test: https://domainkits.com/download/nrds


We acknowledge the European High Performance Computing Joint Undertaking (EuroHPC JU) for awarding this project access to the Leonardo supercomputer, hosted by CINECA in Italy.
Co-funded by the European Union. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European High Performance Computing Joint Undertaking.