DKSplit on EuroHPC: CharBERT and ByT5

In our previous update, we outlined a systematic search for a model with enough world knowledge to handle the cases that DKSplit cannot: multilingual compounds, brand portmanteaus, domain names that require understanding meaning, not just character patterns. We concluded with two planned experiments: a controlled character-level Transformer comparison, and a pretrained byte-level model.

In a separate post, we explored Gemma 4 31B on domain segmentation and found that LLMs possess relevant world knowledge but cannot deliver it with the character-level precision that segmentation requires.

This update covers the CharBERT controlled experiment and the status of ByT5-CRF training.

CharBERT: The Controlled Experiment

Before exploring further, we needed to close an open question from the midterm report. Our earlier from-scratch Transformer had too many uncontrolled variables (no CRF, different data, less rigorous training) to draw architectural conclusions. We designed a more controlled experiment, keeping the remaining differences explicit.

Setup

CharBERT is a from-scratch character-level Transformer encoder with a CRF head — no pretrained weights, randomly initialized. We ran two training groups on different data splits: Group A uses the same data as our previous DeBERTa experiment, and Group B uses the same data as DKSplit’s production model. This lets us compare against both a pretrained subword model and DKSplit itself.

The comparison is not perfectly controlled — CharBERT and DKSplit differ in architecture, parameter count, training epochs, optimizer, and learning rate schedule — but it is substantially more controlled than our earlier attempt.

Results (benchmark_5000[^1])

Model	Strict EM	Lenient EM
DeBERTa v2 (pretrained, Group A data)	4037	4255
CharBERT Group A (no pretrain, Group A data)	4284	4464
DKSplit v0.3.1 (production)[^2]	4343	4521
CharBERT Group B (no pretrain, Group B data)	4402	4563

Two observations stand out:

Group A vs. DeBERTa v2: CharBERT Group A (4464) outperforms DeBERTa v2 (4255) despite using the corresponding training data and no pretrained weights. This suggests that subword-pretrained models are not automatically better for this task. In this setup, DeBERTa’s tokenizer appears poorly matched to character-level boundary prediction: it can merge characters across word boundaries, making some splits difficult or impossible to recover cleanly.

Group B vs. DKSplit: CharBERT Group B (4563) slightly exceeds DKSplit v0.3.1 (4521), but the difference is modest.

Conclusion

CharBERT Group B slightly outperforms DKSplit’s production model in total accuracy, but the gap is modest. In this setup, a randomly initialized Transformer + CRF reaches comparable accuracy — but does not demonstrate a clear enough advantage to justify replacing the production model. We do not plan to release CharBERT.

Since an architecture change alone does not produce a meaningful accuracy improvement, many of the remaining errors appear to require knowledge that neither architecture can reliably learn from character patterns alone.

ByT5-CRF: Early Results

ByT5-Small (Google’s byte-level T5) + CRF is currently training on Leonardo to test whether byte-level pretraining helps this task. Training is still in progress, but we can share preliminary results from the best checkpoint so far (epoch 19 of 20).

Preliminary Results (benchmark_5000[^1])

Model	Strict EM	Lenient EM
DKSplit v0.3.1 (production)[^2]	4343	4521
CharBERT Group B (no pretrain)	4402	4563
ByT5-CRF (epoch 19, preliminary)	4453	4610

ByT5-CRF reduces errors across all three categories compared to DKSplit v0.3.1:

	ByT5-CRF	DKSplit v0.3.1
Total errors	390	479
Over-split	176	198
Under-split	142	184
Wrong boundary	72	97

These are preliminary numbers from an incomplete training run. Training is ongoing and further experiments are planned — full results and analysis will be covered in a future update.

Looking Ahead

Across CharBERT and ByT5-CRF, a pattern emerges: accuracy gains over DKSplit’s production model are real but modest. ByT5-CRF’s improvement of 89 additional correct domains (lenient) comes from a model with roughly 12× the parameters of DKSplit’s BiLSTM-CRF. Whether this trade-off justifies the increased computational cost in production remains an open question — though unlike generative LLMs, ByT5-CRF is a sequence labeling model that can follow the same ONNX optimization and quantization pipeline as the current production model. We will continue to explore this direction.

DKSplit on EuroHPC Series

Models tested on a 5,000-sample multi-method audited benchmark (benchmark_5000), which does not fully cover all real-world scenarios. Lenient EM accepts matches against truth or might_right. NET = rescue – damage, where rescue means the selector corrects a BiLSTM error and damage means it introduces one. All counts are on the full 5,000-sample set unless noted. This is an engineering evaluation. See our midterm report for benchmark methodology.

This work uses models from the Qwen 3.5 and Qwen 3.6 families (Qwen Team, Alibaba Cloud, Apache 2.0), Gemma 4 (Google, Apache 2.0), DeBERTa-V3 (Microsoft, MIT), CharBERT (El Boukkouri et al., Apache 2.0), CANINE (Google, Apache 2.0), and ByT5 (Google Research, Apache 2.0). Training data scoring used DeepSeek V4 Flash (DeepSeek), Gemini 3.1 Flash Lite (Google), and Claude Sonnet 4.6 and Claude Opus 4.6 (Anthropic).

We acknowledge the European High Performance Computing Joint Undertaking (EuroHPC JU) for awarding this project (EHPC-AIF-2026PG01-281) access to the Leonardo supercomputer, hosted by CINECA in Italy.

Co-funded by the European Union. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European High Performance Computing Joint Undertaking.

DKSplit on EuroHPC: CharBERT and ByT5

CharBERT: The Controlled Experiment

Setup

Results (benchmark_5000[^1])

Conclusion

ByT5-CRF: Early Results

Preliminary Results (benchmark_5000[^1])

Looking Ahead

DKSplit on EuroHPC Series

Related Posts

DKSplit on EuroHPC: Final Notes

Model Selection Through Structured Prompting

DKSplit on EuroHPC: Unlocking a 4B Model’s Knowledge Through Chain-of-Thought