DKSplit on EuroHPC: Searching for a Teacher Model Across Architectures

In our previous posts (first, second), we shared early results from EuroHPC Leonardo: BiLSTM upgrades, LLM experiments, and the discovery that different model architectures fail in fundamentally different ways on domain segmentation. We proposed a hybrid pipeline and identified DeBERTa’s subword tokenizer ceiling.

This is the midterm update. We scaled the benchmark to 5,000 samples, tested models across four architecture families, and systematically evaluated hybrid strategies. But the goal was never just to rank models on a leaderboard. DKSplit, our production BiLSTM-CRF, already handles segmentation at scale. The question driving this work is: can we find a model with enough world knowledge to handle the cases that a purely statistical character-level model cannot? Multilingual compounds, brand portmanteaus, domain names that require understanding what the words mean, not just where to split them.

The Problem: What BiLSTM Cannot Do

DKSplit is a 12-million-parameter BiLSTM-CRF that processes ~1,600 domains per second on a single CPU core. On our 5,000-domain benchmark, it reaches 4,521/5,000 (90.4% lenient exact match). For a production segmentation tool, this is strong.

But its 479 errors tell a specific story. DKSplit does not know any language. It has learned statistical patterns over character sequences, but it has no concept of Turkish morphology, Vietnamese syllable structure, German compound rules, or brand names. When it encounters aydindaasansorlutasimacilik (Turkish), batdongsankbang (Vietnamese), or digitalpflegezentrum (German), it guesses based on character n-gram patterns, and it guesses wrong. These are not edge cases; multilingual and brand-related domains make up a growing share of daily registrations.

We need a model that knows things about the world to complement BiLSTM’s statistical precision. Here, “world knowledge” encompasses linguistic regularities (morphology, compound rules), named entities (brands, geography), and domain-intent associations (finance, retail, technology). The EuroHPC experiments are a systematic search for that model.

Key Findings

1. Every Transformer variant we tested has a specific structural limitation for character-level segmentation. DeBERTa’s subword tokenizer creates a hard 93.1% ceiling. CANINE’s downsampling may lose character-boundary precision. A pre-EuroHPC from-scratch Transformer underperformed significantly, though with uncontrolled variables (no CRF, different data, less rigorous training) that prevent architectural conclusions. The right character-level Transformer architecture for this task has not yet been found.

2. LLMs have the world knowledge we are looking for, but cannot yet deliver it with the precision segmentation requires. A fine-tuned 9B LLM reaches 4,431/5,000, still 90 samples behind BiLSTM. More importantly, LLMs are generative and alter characters: applephone becomes apple iphone. This makes them unsuitable as direct segmentation tools in the generative format we tested, but the mutations themselves are a consistently observed phenomenon that warrants deeper investigation.

3. Simple voting captures the strongest practical ensemble gain, but leaves 70% of rescuable errors untouched. Four-model majority voting adds +112 samples over DKSplit alone (4,633 vs. 4,521). More complex strategies (confidence cascading, learned stacking) do not exceed the voting baseline. Of the 373 errors where at least one model has the correct answer, voting recovers 112 (30%). Input-level routing (e.g., by detected language) remains unexplored but requires per-language models that outperform BiLSTM on their respective subsets.

4. LLM character mutations fall into four distinct categories — language completion, spelling correction, high-frequency substitution, and semantic reinterpretation. Fine-tuning reduces mutations by 96% but does not eliminate any category. The surviving mutations cluster on inputs where the intended reading is genuinely ambiguous.

5. The immediate next experiment is a controlled comparison: a from-scratch character-level Transformer with CRF, same data, same methodology as BiLSTM. This must come first because it answers the prerequisite question: can the Transformer architecture itself compete? Only if it does, a pretrained model like ByT5 becomes a meaningful follow-up to test whether pretraining adds world knowledge.

Experimental Setup

Benchmark

benchmark_5000: 5,000 domain name strings with ground-truth segmentation labels. The dataset builds on our earlier 1,000-sample benchmark by adding 4,000 new samples drawn from different batches of newly registered domains across multiple TLDs. Each sample has a truth field and an optional might_right field for cases where multiple segmentations are linguistically valid (e.g., autohaus can be auto haus or remain as a single German compound). We report both strict exact match (must match truth) and lenient exact match (may match either).

We focus exclusively on Exact Match rather than partial metrics like Boundary F1-score. In our downstream production pipeline (e.g., brand impersonation detection), partial correctness has zero business value. A single missed character boundary yields the same matching failure as a completely wrong segmentation.

Models Evaluated

Model	Parameters	Architecture	Method
DKSplit v0.3.1 (production)	12M	BiLSTM-CRF	Character-level sequence labeling
DKSplit v1 (older production)	12M	BiLSTM-CRF	Character-level sequence labeling
CANINE-C (pretrained)	132M	Char Transformer + CRF	Character-level sequence labeling with downsampling
DeBERTa-V3-Base (pretrained)	86M	Subword Transformer + CRF	Subword-to-character sequence labeling
Qwen3.5-9B (fine-tuned)	9B	Decoder LLM	LoRA r=128, generative
Qwen3.5-9B (zero-shot)	9B	Decoder LLM	Generative, no training
WordSegment	–	Statistical unigram model	Frequency-based word segmentation
WordNinja	–	Statistical unigram model	Frequency-based word segmentation

For statistical baselines, we selected WordSegment and WordNinja, the de facto standard open-source tools engineers reach for when dealing with domain name segmentation.

All sequence-labeling models (BiLSTM, CANINE, DeBERTa) output B/I tags on input characters. They cannot add, remove, or change characters, only decide where to place word boundaries. The LLM (Qwen) is generative: it produces output text, which means it can and does alter characters.

Why use LLMs in generative mode? Our hypothesis was that BiLSTM’s remaining errors need language-specific and semantic knowledge to resolve. LLMs, pretrained on massive multilingual corpora, might “know” how to segment these domains. We chose generative evaluation over token classification for three reasons.

First, a practical one: LLM tokenizers operate on subword units, not characters. Mapping subword tokens back to character-level B/I tags requires an alignment layer that introduces its own errors and engineering complexity. Generative mode sidesteps this entirely by letting the model output segmented text directly.

Second, an observational one: generative output is unconstrained. The model can freely express what it “thinks” the input should be, including completing partial words, correcting perceived misspellings, or substituting higher-probability sequences. A classification head restricts the model to a predefined label space, masking exactly the kind of behavior we wanted to observe.

Third, the cost: generative mode means the model can and does alter characters, which makes it unsuitable as a drop-in segmentation tool in this evaluation format. This is not a free choice. It trades precision for observability. The character mutations that result from this trade-off turned out to be the most informative finding of the LLM experiments.

A note on CANINE’s architecture: CANINE uses stride-based downsampling to handle long character sequences, then upsamples back to character resolution. This means some character-level boundary information may be lost during compression.

Results: benchmark_5000

Model	Strict	Lenient	Errors
DKSplit v0.3.1 (production)	4,343/5000	4,521/5000	479
Qwen3.5-9B fine-tuned (adv prompt)	4,245/5000	4,431/5000	569
CANINE-C (epoch 8)	4,249/5000	4,421/5000	579
DKSplit v1 (older production)	4,215/5000	4,402/5000	598
Qwen3.5-9B fine-tuned (std prompt)	4,207/5000	4,395/5000	605
DeBERTa-V3 (epoch 3, 50% data)	4,041/5000	4,273/5000	727
Qwen3.5-9B zero-shot	2,905/5000	3,087/5000	1,913
WordSegment	3,193/5000	3,309/5000	1,691
WordNinja	2,507/5000	2,591/5000	2,409

What the Numbers Tell Us

None of the Transformer or LLM models we tested surpass BiLSTM on segmentation accuracy. But the purpose of these experiments was not to find a drop-in replacement. It was to understand why each architecture falls short and what capabilities each brings.

A note on statistical significance: The gaps between the top models (e.g., BiLSTM vs. CANINE: 100 samples, BiLSTM vs. Qwen fine-tuned: 90 samples) are reported as descriptive counts on a fixed benchmark. We have not run significance tests (e.g., McNemar’s test) on these differences. Readers should treat small differences with caution; the qualitative error analysis below is more informative than the raw ranking.

Fine-tuning vs. zero-shot: Qwen3.5-9B jumps from 3,087 to 4,431 after fine-tuning (+1,344 samples), and this generalizes to completely unseen domains. The zero-shot model scores below the stronger statistical baseline (WordSegment: 3,309), confirming that a 9B LLM without task-specific training has no inherent advantage over dictionary lookup for character-level segmentation.

Inference throughput (single-core CPU, commodity desktop):

Model	Runtime	Throughput (domains/sec)
DKSplit v0.3.1	ONNX Runtime	~1,600
BiLSTM 384	PyTorch	~770
CANINE-C 132M	PyTorch	~100
DeBERTa-V3 86M	PyTorch	~80

Note: DKSplit’s ONNX runtime provides an estimated 2-3x speedup over PyTorch for the same architecture. The BiLSTM 384 row (same architecture, PyTorch) gives a more direct comparison to CANINE and DeBERTa. LLMs require GPU and are not candidates for production segmentation throughput.

Why Each Transformer Falls Short

DeBERTa: Blocked by Its Tokenizer

DeBERTa’s subword tokenizer (SentencePiece) merges characters across word boundaries into single tokens. When this happens, the model cannot place a split there. We measured this: 6.9% of samples (345/5000) have at least one token crossing a word boundary, imposing a hard ceiling of ~93.1%. This is not a training problem; it is architectural. Replacing the tokenizer would invalidate the pretrained weights, defeating the purpose of using a pretrained model.

Conclusion: Subword-tokenized pretrained models are architecturally mismatched for character-level boundary tasks.

CANINE: Character-Level Input, but Downsampling Creates Uncertainty

CANINE’s character-level tokenizer gives it a theoretical 100% ceiling. It reaches 4,421/5,000 (88.4%), trailing BiLSTM by 100 samples. However, CANINE’s stride-based downsampling compresses character sequences before the deep Transformer layers. This may lose fine-grained boundary information, though we have not isolated downsampling as the cause: other factors (model capacity, pretraining data, CRF integration) could also contribute.

What we can say: CANINE with its current architecture does not match BiLSTM-CRF.

What remains open: Whether downsampling is the bottleneck. A clean ablation (e.g., reducing CANINE’s stride) would isolate this variable, but CANINE’s architecture makes this non-trivial. ByT5, which avoids downsampling entirely, offers a different angle on the same question, though a positive result alone would not prove downsampling was CANINE’s limiting factor.

CharBert: Historical Baseline, Variables Not Controlled

In January 2026, before the EuroHPC project, we trained a from-scratch character-level Transformer encoder (8-layer, 512 hidden, 25.5M parameters, softmax classification). It scored 3,830/5,000 lenient on benchmark_5000. We exclude it from the main results table because the comparison is not controlled: it used no CRF layer (all other sequence-labeling models use CRF), an earlier and less diverse training dataset, and a less rigorous training process (no validation set, no learning rate scheduling, no early stopping). With this many confounding variables, the result tells us nothing about the Transformer architecture itself. Rerunning this experiment with controlled variables is one of the two immediate next steps on Leonardo.

Error Overlap Across the Top Four Models

We compare the error sets of the four best models on benchmark_5000: DKSplit v0.3.1, CANINE-C, Qwen3.5-9B fine-tuned (adv prompt), and DeBERTa-V3.

Group	Count
All 4 models wrong	106
Only DKSplit wrong	108
Only CANINE wrong	181
Only Qwen wrong	162
Only DeBERTa wrong	341

The 106 samples where all four models fail are predominantly multilingual domain names: Turkish, Vietnamese, German compounds, brand portmanteaus. These are the cases where world knowledge would help most, and no model we tested has enough of it to consistently get them right.

DKSplit has the fewest unique errors (108): when it fails, others usually fail too. DeBERTa has the most unique errors (341), largely due to its tokenizer ceiling. The Oracle ceiling (at least one of the four is correct) is 4,894/5,000.

Hybrid Pipelines: Can We Combine Models?

The error overlap suggests complementarity: an Oracle ensemble (always picking the correct model) reduces errors from 479 to 106. We tested four strategies using DKSplit v0.3.1 as the primary model:

Strategy	Best Result	vs. DKSplit Solo
Majority vote (4 models, equal weight)	4,633/5000	+112
Weighted vote (4 models, DKSplit 1.1x)	4,633/5000	+112
CRF confidence cascade (CANINE + DeBERTa fallback)	4,605/5000	+84
Learned stacking (Logistic Regression, 4-model, 5-fold CV)	4,612/5000	+91

The Oracle ceiling is 4,894/5,000 (373 rescuable errors out of DKSplit’s 479). The best practical gain: +112 samples from simple majority voting with four models.

Simple majority voting captures the largest practical gain (+112), meaning the raw agreement signal is stronger than the confidence signals we tested. The CRF confidence cascade (+84) and learned stacking (+91) add complexity without exceeding the voting baseline. This makes sense: when three out of four models agree on an answer different from DKSplit, that consensus is usually correct. The harder question is what to do when models disagree without clear consensus, and none of our strategies solve this reliably.

However, 112 of the 373 rescuable errors is still only 30%. The remaining 261 errors have correct answers among the models but no voting or confidence signal strong enough to identify them. Better per-sample routing features (domain length, character entropy, language detection) could narrow this gap, but we have not yet explored these directions.

All four strategies we tested are output-level ensembles: they select among predictions after each model has already committed to an answer. We have not explored input-level routing, such as a lightweight language classifier that dispatches domains to specialized models by detected language. This is a natural next step suggested by the error data, but it requires either per-language models that outperform BiLSTM on their respective subsets, or a reliable language detector that works on concatenated strings without spaces. Neither exists yet in our pipeline.

LLM Behavior: Semantic Knowledge Worth Exploring

LLMs fail at precise segmentation, but we have long suspected that their world knowledge could serve a different purpose. The mutation data from this round of experiments provides concrete evidence for that hypothesis.

Character Mutations

We ran a mutation test on 100,000 real-world domain names from newly registered domain streams (2026-05-07 to 2026-05-11), all unseen during training:

Configuration	Mutations per 100K
Qwen3.5-9B zero-shot	5,618
Qwen3.5-9B fine-tuned, character-preservation prompt, epoch 1	2,361
Qwen3.5-9B fine-tuned, standard prompt, epoch 3	389
Qwen3.5-9B fine-tuned, character-preservation prompt, epoch 3	252

Fine-tuning is the dominant factor: even one epoch reduces mutations by 58%, and by epoch 3 they drop to 252 (96% reduction overall) on completely unseen domains. Prompt design has marginal impact once training converges: swapping prompts at inference changes almost nothing (389 vs. 387; 252 vs. 257). The behavior is baked into weights during training.

What the Mutations Reveal

The mutations are not random. They follow a pattern: the model substitutes what it considers a more probable sequence:

orcalyze becomes oracle lyze (high-frequency substitution: the model replaces “orca” with the more common sequence “oracle”)
bluoil becomes blue oil (category association: energy/petroleum domain)
zirvetekno becomes zirve teknoloji (Turkish language completion: the model completes the Turkish word for “technology”)
paydayloansfortlauderdalefl becomes payday loans for lauderdale fl (the model attempts geographic parsing but merges “fort” into “for”, losing the city name Fort Lauderdale)
allfundinggoupllc becomes all funding group llc (reconstructs missing “r” in “group”: financial/corporate domain)
entertheesperience becomes enter the experience (corrects “esperience” to “experience”: marketing domain)
ethiccouch becomes ethnic couch (semantic probability: “ethnic couch” is more likely in the training distribution)
organictomatos becomes organic tomatoes (spelling correction: food/agriculture domain)

The examples above fall into a few recurring categories: language completion (zirvetekno becomes teknoloji), spelling correction (organictomatos becomes tomatoes), high-frequency substitution (orcalyze becomes oracle), and semantic reinterpretation (ethiccouch becomes ethnic couch). Fine-tuning reduces the total count by 96%, but does not eliminate any category entirely. Even at 252 mutations per 100K, all four types still appear. The mutations that survive training tend to involve inputs where the “correct” reading and the model’s preferred reading are genuinely ambiguous.

These mutations are a liability for segmentation. But they reveal something the character-level models do not have: the model possesses world knowledge about which character sequences are linguistically and commercially plausible. The right question is not how to suppress this knowledge, but how to use it through the right interface. That interface is unlikely to be segmentation itself.

Open Questions

ByT5 and improved character-level Transformer. We tested CANINE (with downsampling) and DeBERTa (with subword tokenization), both structurally limited. ByT5 (Google’s byte-level T5) uses pure byte-level input without either limitation. We also need to rerun the from-scratch Transformer with CRF and full training data. Both experiments are being prepared for Leonardo, with the controlled Transformer-with-CRF run first and ByT5 contingent on its outcome.
DeBERTa with optimized data pipeline. Our DeBERTa used data prepared for BiLSTM. A pipeline designed for subword models might improve results within the 93.1% ceiling, but the ceiling itself is the fundamental constraint.
LLM knowledge beyond segmentation. The mutations demonstrate that LLMs encode information about character sequences that purely statistical models do not. The shape of that information, and the tasks where it could be useful, remain open questions worth pursuing on their own terms rather than as extensions of the segmentation problem.
LLMs as token classifiers. Our current LLM results reflect the generative evaluation format, not necessarily the limits of LLM knowledge for segmentation. A token-classification evaluation (with a subword-to-character alignment layer) would establish a fairer comparison to the sequence-labeling models and separate format-induced errors from genuine knowledge gaps.

What Comes Next

The experiments so far have ruled out two specific structural paths and narrowed the search to a question we could not have formulated before this work:

Can a pure character/byte-level Transformer, without subword tokenization and without downsampling, match BiLSTM-CRF on character-boundary tasks?

DeBERTa is blocked by its tokenizer (a structural limitation we can measure). CANINE underperforms with its current architecture, though we have not isolated downsampling as the sole cause. These two results point toward character/byte-level input without downsampling, but other dimensions remain unexplored: training objectives (MLM vs. span corruption vs. CTC), model scale effects, and alternative structured prediction heads beyond CRF.

Two experiments will address this, in sequence:

First: Character-Level Transformer with CRF (controlled comparison). Same CRF layer, same training data, same methodology as BiLSTM. This is the prerequisite experiment: it isolates the pure architectural question (Transformer attention vs. BiLSTM recurrence) with all other variables controlled. If this fails, it weakens the case for pursuing large pretrained character/byte models as segmentation replacements, though it would not rule out gains from pretraining.

Second, contingent on the first: ByT5 (Byte-level T5). ByT5 removes both subword tokenization and downsampling simultaneously, so it cannot isolate which variable matters. It is an exploratory probe, not a hypothesis test: it answers the end-to-end question of whether a pretrained Transformer, free of both known structural limitations, can match BiLSTM precision while bringing world knowledge. If the controlled Transformer experiment succeeds, ByT5 tests whether pretraining adds value. If the controlled experiment fails, ByT5 becomes lower priority.

The search for a teacher model is not over. The next step is character/byte-level architectures, but other approaches (different training objectives, model scales, structured prediction alternatives) remain untested.

Benchmark Data

The benchmark dataset and per-model prediction results are available for download:

benchmark_5000.csv (5,000 samples with ground truth): Download

DKSplit on EuroHPC Series

A Two-Week Journey on EuroHPC Leonardo
Cleaner Benchmark, First DeBERTa Run, Different Failure Modes
Searching for a Teacher Model Across Architectures (this post)
From Domain Segmentation to Reading Domain Signals
CharBERT and ByT5-CRF

We acknowledge the European High Performance Computing Joint Undertaking (EuroHPC JU) for awarding this project access to the Leonardo supercomputer, hosted by CINECA in Italy.

Co-funded by the European Union. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European High Performance Computing Joint Undertaking.

Update (2026-05-26): Benchmark ground truth corrected — added ~90 might_right entries for linguistically ambiguous samples. All model scores and derived statistics updated accordingly.