DKSplit on EuroHPC: Final Notes

Ten weeks have passed since we received our Playground allocation. Our GPU budget is almost spent, with fewer than 10 hours remaining. We are grateful to the EuroHPC AI Factory Playground for granting access to our open-source project. Without their support, we would not have been able to explore this many directions in parallel or arrive at the conclusions presented in this series. This is the closing post of our EuroHPC Leonardo series.

What We Brought

When we entered the Playground, we had a foundation to build on:

A trained model. Our in-house segmenter, a BiLSTM-CRF , serves as the foundation of our data pipeline, processing hundreds of thousands of domains daily.
A prepared dataset. Our training data had been tuned through multiple iterations on earlier model versions.
Early experiments. We had done small-scale trials with BiLSTM, BERT-based models, and Qwen Models, enough to know the directions worth pursuing but not enough to draw conclusions.
A clear question. How to integrate the pretrained world knowledge of large language models into our domain analysis workflow, where BiLSTM struggles with brand names, multilingual compounds, and coined words.

What We Explored, and What We Learned

For segmentation, is there a better model than BiLSTM?

No. BiLSTM-CRF should remain the foundation of the pipeline. We trained and evaluated models across six architecture families (BiLSTM-CRF, DeBERTa, CharBERT, ByT5-CRF, Qwen, Gemma), analyzing each from both an engineering and accuracy perspective.

Other models, notably ByT5-CRF, showed value as complementary components in a hybrid architecture, but none replaced BiLSTM as the core segmenter. We also found that splitting on digits and hyphens before the model sees the input produced better results than letting any model decide those boundaries itself. Not every problem needs a learned solution.

Can we access the pretrained knowledge inside LLMs?

We fine-tuned Qwen3.5-9B with LoRA at multiple scales, began a full fine-tune of the same model, and trained Gemma 4 31B. Asking LLMs to segment directly did not work. The models showed some of the world knowledge we were looking for, but generation still corrupted the output. Fine-tuning reduced the problem, but in our tests, the same error categories remained. The accuracy was not high enough for LLMs to serve as a teacher for DKSplit. However, when we presented LLMs with a high-coverage candidate set and asked them to reason over the candidates and select, their pretrained knowledge became useful for the task.

Which model, and at what size?

We tested Qwen3.5 and Qwen3.6 models from 0.8B to 122B parameters on the selection task. By comparing their response to structured prompting, we found that for this simple task, smaller models actually benefit from chain-of-thought rules, while larger models are hurt by them.

Through CoT distillation, we trained both a 2B and a 4B model to follow the reasoning path of large models. After training, both reached accuracy comparable to models several times their size in our tests. Because the 2B model is cheaper to run, it is the more practical deployment candidate for this setup. The methodology held across different candidate sets and prompting strategies.

The optimal model size shifts with task complexity. When we ran a preliminary scoring task, asking models to rate each candidate 0–100 instead of simply picking one, the 2B, 4B, and 9B models showed a different pattern. For this kind of complex evaluation, we estimate the sweet spot moves to the 14B–35B range.

How do we validate the results?

Domain segmentation is a narrow task, but its ground truth is not obvious. Even splitting every character individually is not necessarily wrong. What we aim for is convergence: narrowing a divergent set of possible segmentations to the one that best reflects the registrant’s intent.

We expanded our benchmark from 1,000 to 5,000 domains, scored by multiple large models, cross-checked against each other, and audited by hand where they disagreed. The accuracy numbers in this series represent how closely a model’s output matches this converged ground truth. With a more lenient definition of correctness, the numbers would be higher.

What We Produced

DKSplit v0.3.1 — production segmenter, upgraded on Leonardo (PyPI, GitHub, Hugging Face)
benchmark_5000 — public benchmark with ground truth
Qwen3.5-9B LoRA weights — fine-tuned segmentation model (Hugging Face)
Seven technical posts documenting the full experimental path, linked below

Along the way, we trained and evaluated models across six architecture families and a wide range of sizes; the full path is documented in the posts below.

Beyond the Experiments

As a small company, our central challenge is how to integrate AI efficiently into both our customers’ and our own workflows.

Before entering the HPC program, we had already explored how large models can improve daily productivity. Our approach differs from most AI companies: if a user already has access to a large model, we should turn it into a specialized domain analysis agent through data streams and structured harnesses, rather than routing queries through low-tier APIs. This is what DomainKits MCP does.

For our own daily domain analysis, integrating AI to improve efficiency has been an ongoing effort. We routinely analyze registration patterns across 38 million .com domains, tracking trends like the recent surge in AI- and agent-related registrations since early 2026. It is this kind of scale that makes cost control a persistent challenge.

LLM API prices have dropped considerably, but from a data security and processing speed perspective, running task-specific models locally is the better path. Through this EuroHPC journey, we have come to believe that on a sufficiently simple task, providing well-crafted candidate answers and letting a trained small model select and reason over them is both highly efficient and low-cost.

What Comes Next

While generating CoT training data, we noticed that the teacher models did more than just select the best candidate. They explained why. For digitalpflegezentrum, the reasoning stated: “Pflegezentrum is a common, meaningful German compound noun; splitting it is less natural.” The trained 2B model reproduced this kind of reasoning for words it had never seen in training.

There is more useful information inside these models than segmentation alone can extract. If we adjust the training angle, we may be able to identify not just how to split a domain, but why it was registered: the language, the brand, the industry.

Two directions follow:

Optimize the current selector: refine the training data, test at fairer data-to-parameter ratios across model sizes, and integrate the gating model that routes only uncertain predictions to the selector.
Explore approaches for more complex tasks: extracting intent signals from domain strings for use in our production pipeline. With more dimensions and less clear-cut answers, the sweet-spot model size will likely shift upward, and the training data will need to grow accordingly.

DKSplit on EuroHPC Series

This work uses models from the Qwen 3.5 and Qwen 3.6 families (Qwen Team, Alibaba Cloud, Apache 2.0), Gemma 4 (Google, Apache 2.0), DeBERTa-V3 (Microsoft, MIT), CharBERT (El Boukkouri et al., Apache 2.0), CANINE (Google, Apache 2.0), and ByT5 (Google Research, Apache 2.0). Training data scoring used DeepSeek V4 Flash (DeepSeek), Gemini 3.1 Flash Lite (Google), and Claude Sonnet 4.6 and Claude Opus 4.6 (Anthropic).

We acknowledge the European High Performance Computing Joint Undertaking (EuroHPC JU) for awarding this project (EHPC-AIF-2026PG01-281) access to the Leonardo supercomputer, hosted by CINECA in Italy.

Co-funded by the European Union. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European High Performance Computing Joint Undertaking.