In our previous update, we outlined a systematic search for a model with enough world knowledge to handle the cases that DKSplit cannot: multilingual compounds, brand portmanteaus, domain names that require understanding meaning, not just character patterns. We concluded with two planned experiments: a controlled character-level Transformer comparison, and a pretrained byte-level model.
In a separate post, we explored Gemma 4 31B on domain segmentation and found that LLMs possess relevant world knowledge but cannot deliver it with the character-level precision that segmentation requires.
This update covers the CharBERT controlled experiment and the status of ByT5-CRF training.
CharBERT: The Controlled Experiment
Before exploring further, we needed to close an open question from the midterm report. Our earlier from-scratch Transformer had too many uncontrolled variables (no CRF, different data, less rigorous training) to draw architectural conclusions. We designed a more controlled experiment, keeping the remaining differences explicit.
Setup
CharBERT is a from-scratch character-level Transformer encoder with a CRF head — no pretrained weights, randomly initialized. We ran two training groups on different data splits: Group A uses the same data as our previous DeBERTa experiment, and Group B uses the same data as DKSplit’s production model. This lets us compare against both a pretrained subword model and DKSplit itself.
The comparison is not perfectly controlled — CharBERT and DKSplit differ in architecture, parameter count, training epochs, optimizer, and learning rate schedule — but it is substantially more controlled than our earlier attempt.
Results (benchmark_5000[^1])
| Model | Strict EM | Lenient EM |
|---|---|---|
| DeBERTa v2 (pretrained, Group A data) | 4037 | 4255 |
| CharBERT Group A (no pretrain, Group A data) | 4284 | 4464 |
| DKSplit v0.3.1 (production)[^2] | 4343 | 4521 |
| CharBERT Group B (no pretrain, Group B data) | 4402 | 4563 |
Two observations stand out:
Group A vs. DeBERTa v2: CharBERT Group A (4464) outperforms DeBERTa v2 (4255) despite using the corresponding training data and no pretrained weights. This suggests that subword-pretrained models are not automatically better for this task. In this setup, DeBERTa’s tokenizer appears poorly matched to character-level boundary prediction: it can merge characters across word boundaries, making some splits difficult or impossible to recover cleanly.
Group B vs. DKSplit: CharBERT Group B (4563) slightly exceeds DKSplit v0.3.1 (4521), but the difference is modest.
Conclusion
CharBERT Group B slightly outperforms DKSplit’s production model in total accuracy, but the gap is modest. In this setup, a randomly initialized Transformer + CRF reaches comparable accuracy — but does not demonstrate a clear enough advantage to justify replacing the production model. We do not plan to release CharBERT.
Since an architecture change alone does not produce a meaningful accuracy improvement, many of the remaining errors appear to require knowledge that neither architecture can reliably learn from character patterns alone.
ByT5-CRF: Early Results
ByT5-Small (Google’s byte-level T5) + CRF is currently training on Leonardo to test whether byte-level pretraining helps this task. Training is still in progress, but we can share preliminary results from the best checkpoint so far (epoch 19 of 20).
Preliminary Results (benchmark_5000[^1])
| Model | Strict EM | Lenient EM |
|---|---|---|
| DKSplit v0.3.1 (production)[^2] | 4343 | 4521 |
| CharBERT Group B (no pretrain) | 4402 | 4563 |
| ByT5-CRF (epoch 19, preliminary) | 4453 | 4610 |
ByT5-CRF reduces errors across all three categories compared to DKSplit v0.3.1:
| ByT5-CRF | DKSplit v0.3.1 | |
|---|---|---|
| Total errors | 390 | 479 |
| Over-split | 176 | 198 |
| Under-split | 142 | 184 |
| Wrong boundary | 72 | 97 |
These are preliminary numbers from an incomplete training run. Training is ongoing and further experiments are planned — full results and analysis will be covered in a future update.
Looking Ahead
Across CharBERT and ByT5-CRF, a pattern emerges: accuracy gains over DKSplit’s production model are real but modest. ByT5-CRF’s improvement of 89 additional correct domains (lenient) comes from a model with roughly 12× the parameters of DKSplit’s BiLSTM-CRF. Whether this trade-off justifies the increased computational cost in production remains an open question — though unlike generative LLMs, ByT5-CRF is a sequence labeling model that can follow the same ONNX optimization and quantization pipeline as the current production model. We will continue to explore this direction.
DKSplit on EuroHPC Series
- A Two-Week Journey on EuroHPC Leonardo
- Cleaner Benchmark, First DeBERTa Run, Different Failure Modes
- Searching for a Teacher Model Across Architectures
- From Domain Segmentation to Reading Domain Signals
- CharBERT and ByT5-CRF (this post)
[^1]: Strict EM requires matching truth exactly; Lenient EM accepts a match on either truth or might_right. All counts are out of 5,000 domains. See our midterm report for full benchmark methodology. Dataset: benchmark_5000.csv.


We acknowledge the European High Performance Computing Joint Undertaking (EuroHPC JU) for awarding this project access to the Leonardo supercomputer, hosted by CINECA in Italy.
Co-funded by the European Union. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European High Performance Computing Joint Undertaking.