DKSplit on EuroHPC: Our Journey into Large Scale AI Training

We are excited to announce that Lyalpha GmbH has been granted access to the EuroHPC Joint Undertaking’s AI Factory Playground program. This opportunity allows us to explore large scale AI training on Leonardo, one of Europe’s most powerful supercomputers hosted by CINECA in Italy.

Our allocation includes 5,000 GPU hours (1,250 node hours) on Leonardo Booster, equipped with NVIDIA A100 64GB GPUs. This three month access period gives us the chance to validate our technical approach and gain hands on experience with distributed AI training at scale.

About Us and DKSplit

Lyalpha GmbH operates ABTdomain.com, a comprehensive domain intelligence platform that tracks domain lifecycle changes across more than 1,000 gTLDs. A core component of our infrastructure is DKSplit, an open source word segmentation model specifically designed for domain names.

DKSplit addresses a fundamental challenge in domain analysis: splitting concatenated strings like “chatgptlogin” into meaningful words (“chatgpt” and “login”). This capability is essential for understanding domain semantics, detecting brand impersonation, and analyzing registration patterns.

Our current DKSplit model has already evolved through two major versions. The first version used basic dictionary matching and statistical methods. The second version, which we released as open source, introduced a BiLSTM CRF architecture trained on over 17 million labeled samples. The training data was generated using state of the art large language models including GPT 5.1, GPT 5.2, and Gemini 3.1, with multi stage verification to ensure high quality annotations across diverse domain patterns. The resulting model is compact (9MB with INT8 quantization) and achieves approximately 80% agreement with GPT level segmentation and over 90% accuracy on real world domain samples, while processing thousands of queries per second.

Now we are exploring what could become the third generation: leveraging large language models not just for data labeling, but as the model architecture itself.

The Research Question

While DKSplit performs well, we want to explore a fundamental question: Can large language models, when properly trained on domain specific data, outperform our specialized BiLSTM architecture in terms of segmentation accuracy?

This is not about speed. Our BiLSTM model will always be faster for inference. The question is purely about accuracy: Can a billion parameter model, with its vast world knowledge, make better segmentation decisions on ambiguous cases?

During our preliminary experiments, we observed an interesting phenomenon. When testing open source LLMs on domain segmentation, they sometimes “over correct” based on world knowledge. For instance, given “applephone”, a model might infer “apple iphone” because it understands the brand association. For exact segmentation, this is a defect. But it reveals that these models possess semantic reasoning capabilities that our BiLSTM architecture cannot replicate.

Our Experimental Plan

With 5,000 GPU hours, we need to be strategic about resource allocation. Full parameter fine tuning of a 9 billion parameter model would consume most of our budget on hyperparameter search alone, leaving little room for the complete experimental pipeline. Therefore, we have designed a four stage approach that maximizes scientific value within our constraints.

Stage 1: DKSplit Baseline on HPC

First, we will retrain our BiLSTM CRF model on the EuroHPC infrastructure. This serves multiple purposes: establishing a clean baseline under controlled conditions, validating our data pipeline on the HPC environment, and familiarizing ourselves with the Leonardo system.

Estimated resource usage: 50 to 100 GPU hours.

Stage 2: Large Language Model Training with LoRA

The core of our experiment involves training Qwen 3.5 9B on domain segmentation using Low Rank Adaptation (LoRA). Given our resource constraints, full fine tuning is not feasible. LoRA allows us to train a small number of additional parameters while keeping the base model frozen, dramatically reducing computational requirements.

We will use DeepSpeed ZeRO Stage 3 for memory efficient distributed training across multiple A100 GPUs. This technology partitions optimizer states, gradients, and parameters across devices, enabling us to train larger models than would otherwise fit in memory.

The training process includes:

Hyperparameter search on a subset of data (1 million samples)
Progressive training runs on larger data portions (5 million samples)
Final training on the complete 17 million sample dataset

Estimated resource usage: 2,000 to 2,500 GPU hours.

Stage 3: Benchmark Evaluation

We will conduct comprehensive accuracy benchmarks comparing:

DKSplit BiLSTM (our current production model)
Qwen 3.5 9B with LoRA fine tuning
Qwen 3.5 9B zero shot (no fine tuning, for reference)

The evaluation will focus on:

Overall segmentation accuracy (F1 score, exact match rate)
Performance on edge cases (brand names, technical terms, multilingual inputs)
Error analysis to understand where each approach succeeds or fails

Estimated resource usage: 100 to 200 GPU hours.

Stage 4: Knowledge Distillation (Conditional)

If the LoRA fine tuned model demonstrates meaningfully better accuracy than DKSplit, we will proceed with knowledge distillation. The goal is to transfer the large model’s capabilities into a smaller, deployable model in the 1 billion to 1.5 billion parameter range.

The distillation process involves using the 9B model as a “teacher” to generate soft labels, then training a smaller “student” model to replicate the teacher’s outputs. This approach can capture much of the large model’s performance in a more efficient package suitable for local deployment.

Estimated resource usage: 800 to 1,000 GPU hours (if proceeding).

Resource Budget Summary

Stage	GPU Hours
DKSplit baseline training	50 to 100
LLM LoRA training with hyperparameter search	2,000 to 2,500
Benchmark evaluation	100 to 200
Knowledge distillation (conditional)	800 to 1,000
Buffer for iterations and unexpected issues	1,200 to 2,000
Total	4,150 to 5,800

The budget is tight. This is precisely why we chose LoRA over full fine tuning. With full parameter training, hyperparameter search alone could exhaust our entire allocation, leaving no room for the distillation experiments that could yield the most practical value.

What We Hope to Learn

This experiment addresses several questions important to our work:

Can LLMs outperform specialized architectures on domain specific tasks? Our BiLSTM model was designed specifically for domain segmentation. A general purpose LLM has broader knowledge but less task specific optimization. Understanding this tradeoff helps guide future development decisions.

Is knowledge distillation viable for our use case? If we can capture LLM level accuracy in a 1 billion parameter model, it opens new possibilities for deployment. Such a model could run on consumer hardware while potentially exceeding our current BiLSTM’s accuracy.

How do we effectively use HPC resources for AI training? This is our first experience with supercomputing scale infrastructure. Learning to work with DeepSpeed ZeRO, distributed training, and HPC job scheduling systems is valuable regardless of experimental outcomes.

Expected Outcomes

By the end of this project, we expect to deliver:

A comprehensive benchmark comparing BiLSTM and LLM approaches to domain segmentation
Detailed documentation of our HPC training workflow using DeepSpeed ZeRO
If successful, a distilled model suitable for local deployment
Lessons learned and best practices for SMEs working with EuroHPC resources

We will share our findings in a follow up post once the experiments are complete.

Looking Forward

Beyond segmentation accuracy, this work has broader implications. Accurate domain name understanding is the foundation for more complex semantic analysis tasks. If we can reliably parse domain strings into meaningful components, we unlock capabilities like automated brand threat detection, phishing identification, and large scale domain intelligence.

The techniques we are exploring here, LoRA fine tuning, DeepSpeed ZeRO distributed training, and knowledge distillation, represent a methodology that extends far beyond this single use case. What we learn on Leonardo will inform how we approach AI development at scale.

About EuroHPC AI Factory Playground

The EuroHPC AI Factory Playground program provides European SMEs and startups with access to world class supercomputing resources for AI development. It offers quick, lightweight access designed for entry level industrial users or those new to high performance computing. Applications are processed on a first come, first served basis, with access typically granted within two working days.

We are grateful to the EuroHPC Joint Undertaking and CINECA for making this opportunity available. For SMEs like us, access to high performance computing has traditionally been out of reach. Programs like Playground change that equation, enabling European companies to experiment with AI at scale.

If you are an SME interested in exploring AI training at scale, we encourage you to learn more at eurohpc-ju.europa.eu/ai-factories.

DKSplit is available as open source software at github.com/ABTdomain/dksplit.