Continuing the work we started in our previous post.
What We Did These Two Weeks
Leonardo was in a planned maintenance window for most of the past two weeks. With less compute available, we focused on the things we could do without it, and ran the experiments we could fit into the remaining time.
We did three things:
- Adjusted the benchmark. We removed two categories of samples that should not have been in there to begin with: pure-noise strings with no language signal, and strings whose segmentation is decided by digits rather than language. The second category is something we should also handle with rules in production rather than asking the model to do it. The new benchmark is available at github.com/ABTdomain/dksplit/tree/main/benchmark.
- Trained a first DeBERTa-V3 model (v0.1). This is the next architecture we wanted to try, mentioned in the previous post. The Qwen prompt-optimization run is also in flight, but its training did not finish before maintenance started.
- Compared four segmenters on the new benchmark and looked at what kinds of errors each one makes.
What follows is the comparison, the takeaway, and what comes next.
What We Changed in the Benchmark
For this round we made two changes to the benchmark. First, we removed two categories of samples that the previous test set was carrying:
- Digit-driven inputs. Inputs like
824fisheror3999bethelloare effectively pre-split by digit boundaries. Any model gets these “right” without doing any real linguistic work, which means they were inflating accuracy without testing the part of the task we actually care about. - Pure-noise strings. Inputs like
hbwhjhzxorwlmqsyzcontain no vowels and no language signal at all. Whatever a model outputs on these is a guess, so they added noise to every model’s score in either direction.
Both kinds of inputs are better handled by deterministic rules in production rather than asked of the model. For example, 247buyacar should be pre-split on the digit boundary into 247 and buyacar before segmentation runs, and pure-noise consonant clusters can be filtered out by a simple vowel-check rule. We are moving this kind of pre-processing into the production pipeline.
Second, we added more samples and re-audited the whole set, ending up with 1,000 hand-audited prefixes that focus on what the segmenter actually has to do: deciding word boundaries inside concatenated language.
A few things worth being explicit about, since they affect how to read the results below:
- Lenient EM is defined as: prediction matches
truthexactly, or matchesmight_rightexactly. Strict EM only counts matches againsttruth. Both are computed at the full-string level after lowercasing. - The labeling style favors brand-preserving segmentations. Where a string can plausibly be a brand kept whole or a phrase split apart, we tend to keep the brand whole and put the alternative in
might_right. This is a deliberate design choice for our use cases (domain analysis, brand monitoring), not a universal truth. A team running an SEO or aggressive-recall workflow would reasonably re-label the same set differently. - 627 of the 1,000 samples are .com domains registered in April 2026, after the training cutoff of every model we evaluate. They are not guaranteed unseen by all models, but the freshness reduces the risk that we are mostly measuring memorization on this slice of the test set.
Four Segmenters on the New Benchmark
We ran four families of segmenters on this benchmark.
| Model | Type | Lenient EM |
|---|---|---|
| DKSplit (current production) | Character-level BiLSTM-CRF | 91.5% |
| DeBERTa-V3 v0.1 | Subword transformer + CRF | 89.4% |
| WordSegment | Dictionary + dynamic programming | 69.4% |
| WordNinja | Word-frequency greedy split | 53.9% |
A note on the dictionary-based libraries. WordSegment and WordNinja work well on standard English text. They were not designed for newly registered domains, many of which are brand coinages, multilingual compounds, or intentional misspellings. The 69.4% / 53.9% numbers are not a fair indictment of those libraries. They show that the task has shifted away from what those libraries were built for.
The interesting comparison is between DKSplit and DeBERTa.
How Each Model Fails
Segmenting a concatenated string only has two error types. The model never adds characters, only inserts spaces. So either:
- It splits where it shouldn’t have (over-segmentation), or
- It fails to split where it should have (under-segmentation).
Boundary shifts are just both at once, in different positions.
When we classify every miss this way, the picture changes:
| Model | Over-segmentation | Under-segmentation | Error pattern |
|---|---|---|---|
| DKSplit | 1.8% | 1.2% | Balanced |
| DeBERTa v0.1 | 1.4% | 8.4% | Conservative |
| WordSegment | 23.9% | 0.9% | Aggressive |
| WordNinja | 39.3% | 0.3% | Very aggressive |
DKSplit errs in either direction with about equal probability. There is no systematic bias.
DeBERTa v0.1 produces the fewest over-segmentation errors of any model we tested, fewer even than DKSplit. But it under-segments at close to 8%. When DeBERTa is uncertain, it tends to leave the string alone rather than insert a split.
WordSegment and WordNinja both lean strongly toward over-segmentation. When their dictionary cannot cover an input, they break it apart greedily.
A note on sample size. With 1,000 prefixes, we are looking at low-double-digit error counts for DKSplit and DeBERTa each. Small differences in those percentages are not statistically reliable on their own. What we found more informative was looking at the actual cases where each model fails. The pattern there is much clearer than any single number.
This is the key point of this whole comparison: the gap in headline accuracy between DKSplit and DeBERTa is small, but the kinds of mistakes they make are not the same kind at all. Two models can sit within a couple of percentage points of each other and still behave very differently in production, depending on which type of error your application can tolerate.
Some examples where DKSplit over-segments and DeBERTa gets it right. The “Also accepted” column shows the alternative segmentation our benchmark accepts (the might_right field), if any:
| Input | Truth | Might right | DKSplit | DeBERTa v0.1 |
|---|---|---|---|---|
traderooapp | traderoo app | n/a | trade roo app | traderoo app |
tokonameshaken | tokoname shaken | n/a | toko name shaken | tokoname shaken |
airopsaxis | airops axis | n/a | air ops axis | airops axis |
nowetas | nowetas | n/a | now etas | nowetas |
implementaonline | implementa online | n/a | implement a online | implementa online |
The truth here is the brand-preserving form. DKSplit broke the brand or the unfamiliar root into smaller pieces; DeBERTa kept it whole.
Some examples where DeBERTa under-segments and DKSplit gets it right:
| Input | Truth | Might right | DKSplit | DeBERTa v0.1 |
|---|---|---|---|---|
ninjagaidentv | ninja gaiden tv | n/a | ninja gaiden tv | ninja gaidentv |
feedsentry | feed sentry | n/a | feed sentry | feedsentry |
flykestral | fly kestral | n/a | fly kestral | flykestral |
greentechcostarica | green tech costa rica | n/a | green tech costa rica | green tech costarica |
missioncriticalgovernance | mission critical governance | n/a | mission critical governance | missioncritical governance |
It is worth pointing out that these are not “brand words DKSplit happened to get lucky on.” They are everyday phrases like feed sentry, fly kestral, green tech costa rica, mission critical governance. The right answer is to split, DKSplit splits, and DeBERTa does not. DeBERTa’s conservatism shows up on plain words too, not just on inputs that happen to look like brand names.
In short: DKSplit is more willing to insert a boundary when it isn’t sure; DeBERTa is more willing to leave the string alone.
Our reading is straightforward: DeBERTa v0.1 is undertrained. This is a single run with default hyperparameters and no ablations, so the model has only had one shot at this task. The unchanged-output rate is consistent with that. Roughly an eighth of its outputs are the input string returned unchanged, which is what we would expect from a model that has not yet learned to commit to boundary decisions on unfamiliar inputs. On the remaining samples, where it does make a real call, it is much closer to DKSplit. So the gap is not that the architecture cannot do this task. The gap is that the model has not been given enough of the right training signal yet. Two things to fix in v0.2 are proper hyperparameter and ablation work, and a better data preparation step, in particular how character-level labels are mapped onto subword tokens. The encouraging part is that v0.1 already shows useful behavior on more than half of the test set, so the potential is there for the next versions to realize.
Different models have different failure modes. A useful pipeline can use each of them where their failure mode is acceptable: a conservative model in places where false splits are costly, a balanced model as a general default, a fast aggressive segmenter for recall-oriented tasks with downstream filtering.
This is the direction we are heading: a hybrid pipeline where DKSplit handles the main traffic, DeBERTa is consulted as a second opinion on high-stakes inputs, and disagreements between the two are escalated to a larger model or a human reviewer. Every escalation feeds back into the dataset for the next training round.
What’s Next
A few specific things we are working on:
- Continue DeBERTa training when Leonardo comes back. v0.1 is a starting point. There is more to do on the training data and on hyperparameters before we know what this architecture is actually capable of on this task.
- Finish the queued Qwen run. The Qwen 9B prompt-optimization experiment is still in the queue. We expect to share results in the next update.
- Keep growing the audited benchmark. Especially around ambiguous brand-versus-compound cases, where reasonable people disagree on the right split.
A single accuracy number compresses a lot of information. Looking at how each model is wrong gave us a different picture of what each one is good for. The hybrid prototype mentioned earlier in this post is the most direct way we are putting that into practice.
Side note: the Qwen 9B LoRA checkpoint from our previous post is available on Hugging Face at ABTdomain/dksplit-qwen-lora. We use it for research and offline labeling, not as a runtime segmenter. Production still runs on the BiLSTM in the dksplit pip package.
Models tested locally on a 1000-sample audited benchmark. DKSplit corresponds to the current public pip release. DeBERTa-V3 v0.1 was trained on EuroHPC Leonardo as part of the AI Factory Playground program. Detailed per-sample results and the benchmark file are available on request. This is an engineering evaluation, not an academic one. We have not measured inter-annotator agreement or used probabilistic labels. The benchmark is open so others can apply stricter methodology if they need to.


We acknowledge the European High Performance Computing Joint Undertaking (EuroHPC JU) for awarding this project access to the Leonardo supercomputer, hosted by CINECA in Italy.
Co-funded by the European Union. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European High Performance Computing Joint Undertaking.