DKSplit on EuroHPC: Searching for a Teacher Model Across Architectures

In our previous posts (first, second), we shared early results from EuroHPC Leonardo: BiLSTM upgrades, LLM experiments, and the discovery that different model architectures fail in fundamentally different ways on domain segmentation. We proposed a hybrid pipeline and identified DeBERTa’s subword tokenizer ceiling.

This is the midterm update. We scaled the benchmark to 5,000 samples, tested models across four architecture families, and systematically evaluated hybrid strategies. But the goal was never just to rank models on a leaderboard. DKSplit, our production BiLSTM-CRF, already handles segmentation at scale. The question driving this work is: can we find a model with enough world knowledge to handle the cases that a purely statistical character-level model cannot? Multilingual compounds, brand portmanteaus, domain names that require understanding what the words mean, not just where to split them.

The Problem: What BiLSTM Cannot Do

DKSplit is a 12-million-parameter BiLSTM-CRF that processes ~1,600 domains per second on a single CPU core. On our 5,000-domain benchmark, it reaches 4,451/5,000 (89.0% lenient exact match). For a production segmentation tool, this is strong.

But its 549 errors tell a specific story. DKSplit does not know any language. It has learned statistical patterns over character sequences, but it has no concept of Turkish morphology, Vietnamese syllable structure, German compound rules, or brand names. When it encounters aydindaasansorlutasimacilik (Turkish), batdongsankbang (Vietnamese), or digitalpflegezentrum (German), it guesses based on character n-gram patterns, and it guesses wrong. These are not edge cases; multilingual and brand-related domains make up a growing share of daily registrations.

We need a model that knows things about the world to complement BiLSTM’s statistical precision. Here, “world knowledge” encompasses linguistic regularities (morphology, compound rules), named entities (brands, geography), and domain-intent associations (finance, retail, technology). The EuroHPC experiments are a systematic search for that model.

Key Findings

1. Every Transformer variant we tested has a specific structural limitation for character-level segmentation. DeBERTa’s subword tokenizer creates a hard 93.1% ceiling. CANINE’s downsampling may lose character-boundary precision. A pre-EuroHPC from-scratch Transformer underperformed significantly, though with uncontrolled variables (no CRF, different data, less rigorous training) that prevent architectural conclusions. The right character-level Transformer architecture for this task has not yet been found.

2. LLMs have the world knowledge we are looking for, but cannot yet deliver it with the precision segmentation requires. A fine-tuned 9B LLM reaches 4,337/5,000, still 114 samples behind BiLSTM. More importantly, LLMs are generative and alter characters: applephone becomes apple iphone. This makes them unsuitable as direct segmentation tools in the generative format we tested, but the mutations themselves are a consistently observed phenomenon that warrants deeper investigation.

3. Simple voting captures the strongest practical ensemble gain, but leaves 74% of rescuable errors untouched. Four-model majority voting adds +108 samples over DKSplit alone (4,559 vs. 4,451). More complex strategies (confidence cascading, learned stacking) do not exceed the voting baseline. Of the 408 errors where at least one model has the correct answer, voting recovers 108 (26%). Input-level routing (e.g., by detected language) remains unexplored but requires per-language models that outperform BiLSTM on their respective subsets.

4. LLM character mutations fall into four distinct categories — language completion, spelling correction, high-frequency substitution, and semantic reinterpretation. Fine-tuning reduces mutations by 96% but does not eliminate any category. The surviving mutations cluster on inputs where the intended reading is genuinely ambiguous.

5. The immediate next experiment is a controlled comparison: a from-scratch character-level Transformer with CRF, same data, same methodology as BiLSTM. This must come first because it answers the prerequisite question: can the Transformer architecture itself compete? Only if it does, a pretrained model like ByT5 becomes a meaningful follow-up to test whether pretraining adds world knowledge.

Experimental Setup

Benchmark

benchmark_5000: 5,000 domain name strings with ground-truth segmentation labels. The dataset builds on our earlier 1,000-sample benchmark by adding 4,000 new samples drawn from different batches of newly registered domains across multiple TLDs. Each sample has a truth field and an optional might_right field for cases where multiple segmentations are linguistically valid (e.g., autohaus can be auto haus or remain as a single German compound). We report both strict exact match (must match truth) and lenient exact match (may match either).

We focus exclusively on Exact Match rather than partial metrics like Boundary F1-score. In our downstream production pipeline (e.g., brand impersonation detection), partial correctness has zero business value. A single missed character boundary yields the same matching failure as a completely wrong segmentation.

Models Evaluated

ModelParametersArchitectureMethod
DKSplit v0.3.1 (production)12MBiLSTM-CRFCharacter-level sequence labeling
DKSplit v1 (older production)12MBiLSTM-CRFCharacter-level sequence labeling
CANINE-C (pretrained)132MChar Transformer + CRFCharacter-level sequence labeling with downsampling
DeBERTa-V3-Base (pretrained)86MSubword Transformer + CRFSubword-to-character sequence labeling
Qwen3.5-9B (fine-tuned)9BDecoder LLMLoRA r=128, generative
Qwen3.5-9B (zero-shot)9BDecoder LLMGenerative, no training
WordSegmentStatistical unigram modelFrequency-based word segmentation
WordNinjaStatistical unigram modelFrequency-based word segmentation

For statistical baselines, we selected WordSegment and WordNinja, the de facto standard open-source tools engineers reach for when dealing with domain name segmentation.

All sequence-labeling models (BiLSTM, CANINE, DeBERTa) output B/I tags on input characters. They cannot add, remove, or change characters, only decide where to place word boundaries. The LLM (Qwen) is generative: it produces output text, which means it can and does alter characters.

Why use LLMs in generative mode? Our hypothesis was that BiLSTM’s remaining errors need language-specific and semantic knowledge to resolve. LLMs, pretrained on massive multilingual corpora, might “know” how to segment these domains. We chose generative evaluation over token classification for three reasons.

First, a practical one: LLM tokenizers operate on subword units, not characters. Mapping subword tokens back to character-level B/I tags requires an alignment layer that introduces its own errors and engineering complexity. Generative mode sidesteps this entirely by letting the model output segmented text directly.

Second, an observational one: generative output is unconstrained. The model can freely express what it “thinks” the input should be, including completing partial words, correcting perceived misspellings, or substituting higher-probability sequences. A classification head restricts the model to a predefined label space, masking exactly the kind of behavior we wanted to observe.

Third, the cost: generative mode means the model can and does alter characters, which makes it unsuitable as a drop-in segmentation tool in this evaluation format. This is not a free choice. It trades precision for observability. The character mutations that result from this trade-off turned out to be the most informative finding of the LLM experiments.

A note on CANINE’s architecture: CANINE uses stride-based downsampling to handle long character sequences, then upsamples back to character resolution. This means some character-level boundary information may be lost during compression.

Results: benchmark_5000

ModelStrictLenientErrors
DKSplit v0.3.1 (production)4,340/50004,451/5000549
CANINE-C (epoch 8)4,245/50004,340/5000660
Qwen3.5-9B fine-tuned (adv prompt)4,238/50004,337/5000663
DKSplit v1 (older production)4,211/50004,328/5000672
Qwen3.5-9B fine-tuned (std prompt)4,204/50004,311/5000689
DeBERTa-V3 (epoch 3, 50% data)4,049/50004,170/5000830
Qwen3.5-9B zero-shot2,901/50002,984/50002,016
WordSegment3,180/50003,269/50001,731
WordNinja2,501/50002,568/50002,432

What the Numbers Tell Us

None of the Transformer or LLM models we tested surpass BiLSTM on segmentation accuracy. But the purpose of these experiments was not to find a drop-in replacement. It was to understand why each architecture falls short and what capabilities each brings.

A note on statistical significance: The gaps between the top models (e.g., BiLSTM vs. CANINE: 111 samples, BiLSTM vs. Qwen fine-tuned: 114 samples) are reported as descriptive counts on a fixed benchmark. We have not run significance tests (e.g., McNemar’s test) on these differences. Readers should treat small differences with caution; the qualitative error analysis below is more informative than the raw ranking.

Fine-tuning vs. zero-shot: Qwen3.5-9B jumps from 2,984 to 4,337 after fine-tuning (+1,353 samples), and this generalizes to completely unseen domains. The zero-shot model scores below both statistical baselines (WordSegment: 3,269), confirming that a 9B LLM without task-specific training has no inherent advantage over dictionary lookup for character-level segmentation.

Inference throughput (single-core CPU, commodity desktop):

ModelRuntimeThroughput (domains/sec)
DKSplit v0.3.1ONNX Runtime~1,600
BiLSTM 384PyTorch~770
CANINE-C 132MPyTorch~100
DeBERTa-V3 86MPyTorch~80

Note: DKSplit’s ONNX runtime provides an estimated 2-3x speedup over PyTorch for the same architecture. The BiLSTM 384 row (same architecture, PyTorch) gives a more direct comparison to CANINE and DeBERTa. LLMs require GPU and are not candidates for production segmentation throughput.

Why Each Transformer Falls Short

DeBERTa: Blocked by Its Tokenizer

DeBERTa’s subword tokenizer (SentencePiece) merges characters across word boundaries into single tokens. When this happens, the model cannot place a split there. We measured this: 6.9% of samples (345/5000) have at least one token crossing a word boundary, imposing a hard ceiling of ~93.1%. This is not a training problem; it is architectural. Replacing the tokenizer would invalidate the pretrained weights, defeating the purpose of using a pretrained model.

Conclusion: Subword-tokenized pretrained models are architecturally mismatched for character-level boundary tasks.

CANINE: Character-Level Input, but Downsampling Creates Uncertainty

CANINE’s character-level tokenizer gives it a theoretical 100% ceiling. It reaches 4,340/5,000 (86.8%), trailing BiLSTM by 111 samples. However, CANINE’s stride-based downsampling compresses character sequences before the deep Transformer layers. This may lose fine-grained boundary information, though we have not isolated downsampling as the cause: other factors (model capacity, pretraining data, CRF integration) could also contribute.

What we can say: CANINE with its current architecture does not match BiLSTM-CRF.

What remains open: Whether downsampling is the bottleneck. A clean ablation (e.g., reducing CANINE’s stride) would isolate this variable, but CANINE’s architecture makes this non-trivial. ByT5, which avoids downsampling entirely, offers a different angle on the same question, though a positive result alone would not prove downsampling was CANINE’s limiting factor.

CharBert: Historical Baseline, Variables Not Controlled

In January 2026, before the EuroHPC project, we trained a from-scratch character-level Transformer encoder (8-layer, 512 hidden, 25.5M parameters, softmax classification). It scored 3,717/5,000 lenient on benchmark_5000. We exclude it from the main results table because the comparison is not controlled: it used no CRF layer (all other sequence-labeling models use CRF), an earlier and less diverse training dataset, and a less rigorous training process (no validation set, no learning rate scheduling, no early stopping). With this many confounding variables, the result tells us nothing about the Transformer architecture itself. Rerunning this experiment with controlled variables is one of the two immediate next steps on Leonardo.

Error Overlap Across the Top Four Models

We compare the error sets of the four best models on benchmark_5000: DKSplit v0.3.1, CANINE-C, Qwen3.5-9B fine-tuned (adv prompt), and DeBERTa-V3.

GroupCount
All 4 models wrong141
Only DKSplit wrong111
Only CANINE wrong185
Only Qwen wrong165
Only DeBERTa wrong352

The 141 samples where all four models fail are predominantly multilingual domain names: Turkish, Vietnamese, German compounds, brand portmanteaus. These are the cases where world knowledge would help most, and no model we tested has enough of it to consistently get them right.

DKSplit has the fewest unique errors (111): when it fails, others usually fail too. DeBERTa has the most unique errors (352), largely due to its tokenizer ceiling. The Oracle ceiling (at least one of the four is correct) is 4,859/5,000.

Hybrid Pipelines: Can We Combine Models?

The error overlap suggests complementarity: an Oracle ensemble (always picking the correct model) reduces errors from 549 to 141. We tested four strategies using DKSplit v0.3.1 as the primary model:

StrategyBest Resultvs. DKSplit Solo
Majority vote (4 models, equal weight)4,559/5000+108
Weighted vote (4 models, DKSplit 1.1x)4,559/5000+108
CRF confidence cascade (CANINE + DeBERTa fallback)4,528/5000+77
Learned stacking (Logistic Regression, 4-model, 5-fold CV)4,542/5000+91

The Oracle ceiling is 4,859/5,000 (408 rescuable errors out of DKSplit’s 549). The best practical gain: +108 samples from simple majority voting with four models.

Simple majority voting captures the largest practical gain (+108), meaning the raw agreement signal is stronger than the confidence signals we tested. The CRF confidence cascade (+77) and learned stacking (+91) add complexity without exceeding the voting baseline. This makes sense: when three out of four models agree on an answer different from DKSplit, that consensus is usually correct. The harder question is what to do when models disagree without clear consensus, and none of our strategies solve this reliably.

However, 108 of the 408 rescuable errors is still only 26%. The remaining 300 errors have correct answers among the models but no voting or confidence signal strong enough to identify them. Better per-sample routing features (domain length, character entropy, language detection) could narrow this gap, but we have not yet explored these directions.

All four strategies we tested are output-level ensembles: they select among predictions after each model has already committed to an answer. We have not explored input-level routing, such as a lightweight language classifier that dispatches domains to specialized models by detected language. This is a natural next step suggested by the error data, but it requires either per-language models that outperform BiLSTM on their respective subsets, or a reliable language detector that works on concatenated strings without spaces. Neither exists yet in our pipeline.

LLM Behavior: Semantic Knowledge Worth Exploring

LLMs fail at precise segmentation, but we have long suspected that their world knowledge could serve a different purpose. The mutation data from this round of experiments provides concrete evidence for that hypothesis.

Character Mutations

We ran a mutation test on 100,000 real-world domain names from newly registered domain streams (2026-05-07 to 2026-05-11), all unseen during training:

ConfigurationMutations per 100K
Qwen3.5-9B zero-shot5,618
Qwen3.5-9B fine-tuned, character-preservation prompt, epoch 12,361
Qwen3.5-9B fine-tuned, standard prompt, epoch 3389
Qwen3.5-9B fine-tuned, character-preservation prompt, epoch 3252

Fine-tuning is the dominant factor: even one epoch reduces mutations by 58%, and by epoch 3 they drop to 252 (96% reduction overall) on completely unseen domains. Prompt design has marginal impact once training converges: swapping prompts at inference changes almost nothing (389 vs. 387; 252 vs. 257). The behavior is baked into weights during training.

What the Mutations Reveal

The mutations are not random. They follow a pattern: the model substitutes what it considers a more probable sequence:

  • orcalyze becomes oracle lyze (high-frequency substitution: the model replaces “orca” with the more common sequence “oracle”)
  • bluoil becomes blue oil (category association: energy/petroleum domain)
  • zirvetekno becomes zirve teknoloji (Turkish language completion: the model completes the Turkish word for “technology”)
  • paydayloansfortlauderdalefl becomes payday loans for lauderdale fl (the model attempts geographic parsing but merges “fort” into “for”, losing the city name Fort Lauderdale)
  • allfundinggoupllc becomes all funding group llc (reconstructs missing “r” in “group”: financial/corporate domain)
  • entertheesperience becomes enter the experience (corrects “esperience” to “experience”: marketing domain)
  • ethiccouch becomes ethnic couch (semantic probability: “ethnic couch” is more likely in the training distribution)
  • organictomatos becomes organic tomatoes (spelling correction: food/agriculture domain)

The examples above fall into a few recurring categories: language completion (zirvetekno becomes teknoloji), spelling correction (organictomatos becomes tomatoes), high-frequency substitution (orcalyze becomes oracle), and semantic reinterpretation (ethiccouch becomes ethnic couch). Fine-tuning reduces the total count by 96%, but does not eliminate any category entirely. Even at 252 mutations per 100K, all four types still appear. The mutations that survive training tend to involve inputs where the “correct” reading and the model’s preferred reading are genuinely ambiguous.

These mutations are a liability for segmentation. But they reveal something the character-level models do not have: the model possesses world knowledge about which character sequences are linguistically and commercially plausible. The right question is not how to suppress this knowledge, but how to use it through the right interface. That interface is unlikely to be segmentation itself.

Open Questions

  1. ByT5 and improved character-level Transformer. We tested CANINE (with downsampling) and DeBERTa (with subword tokenization), both structurally limited. ByT5 (Google’s byte-level T5) uses pure byte-level input without either limitation. We also need to rerun the from-scratch Transformer with CRF and full training data. Both experiments are being prepared for Leonardo, with the controlled Transformer-with-CRF run first and ByT5 contingent on its outcome.
  2. DeBERTa with optimized data pipeline. Our DeBERTa used data prepared for BiLSTM. A pipeline designed for subword models might improve results within the 93.1% ceiling, but the ceiling itself is the fundamental constraint.
  3. LLM knowledge beyond segmentation. The mutations demonstrate that LLMs encode information about character sequences that purely statistical models do not. The shape of that information, and the tasks where it could be useful, remain open questions worth pursuing on their own terms rather than as extensions of the segmentation problem.
  4. LLMs as token classifiers. Our current LLM results reflect the generative evaluation format, not necessarily the limits of LLM knowledge for segmentation. A token-classification evaluation (with a subword-to-character alignment layer) would establish a fairer comparison to the sequence-labeling models and separate format-induced errors from genuine knowledge gaps.

What Comes Next

The experiments so far have ruled out two specific structural paths and narrowed the search to a question we could not have formulated before this work:

Can a pure character/byte-level Transformer, without subword tokenization and without downsampling, match BiLSTM-CRF on character-boundary tasks?

DeBERTa is blocked by its tokenizer (a structural limitation we can measure). CANINE underperforms with its current architecture, though we have not isolated downsampling as the sole cause. These two results point toward character/byte-level input without downsampling, but other dimensions remain unexplored: training objectives (MLM vs. span corruption vs. CTC), model scale effects, and alternative structured prediction heads beyond CRF.

Two experiments will address this, in sequence:

First: Character-Level Transformer with CRF (controlled comparison). Same CRF layer, same training data, same methodology as BiLSTM. This is the prerequisite experiment: it isolates the pure architectural question (Transformer attention vs. BiLSTM recurrence) with all other variables controlled. If this fails, it weakens the case for pursuing large pretrained character/byte models as segmentation replacements, though it would not rule out gains from pretraining.

Second, contingent on the first: ByT5 (Byte-level T5). ByT5 removes both subword tokenization and downsampling simultaneously, so it cannot isolate which variable matters. It is an exploratory probe, not a hypothesis test: it answers the end-to-end question of whether a pretrained Transformer, free of both known structural limitations, can match BiLSTM precision while bringing world knowledge. If the controlled Transformer experiment succeeds, ByT5 tests whether pretraining adds value. If the controlled experiment fails, ByT5 becomes lower priority.

The search for a teacher model is not over. The next step is character/byte-level architectures, but other approaches (different training objectives, model scales, structured prediction alternatives) remain untested.

Benchmark Data

The benchmark dataset and per-model prediction results are available for download:

  • benchmark_5000.csv (5,000 samples with ground truth): Download

DKSplit on EuroHPC Series

  1. A Two-Week Journey on EuroHPC Leonardo
  2. Cleaner Benchmark, First DeBERTa Run, Different Failure Modes
  3. Searching for a Teacher Model Across Architectures (this post)

We acknowledge the European High Performance Computing Joint Undertaking (EuroHPC JU) for awarding this project access to the Leonardo supercomputer, hosted by CINECA in Italy.

Co-funded by the European Union. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European High Performance Computing Joint Undertaking.