DKSplit on EuroHPC: Unlocking a 4B Model’s Knowledge Through Chain-of-Thought

DKSplit on EuroHPC Series #6

In our previous posts, we ran experiments across multiple architectures on EuroHPC Leonardo: BiLSTM-CRF, DeBERTa-V3, CANINE, CharBERT, ByT5-CRF, and generative LLMs. Each architecture brought incremental improvements, but the remaining errors cluster in the same places across every model: brand names, multilingual compounds, domain-specific coinages. Cases that need world knowledge, not better pattern matching.

This post is about a change in approach. Instead of building a better segmenter, we asked: can we use the LLM’s world knowledge not to split domains, but to pick the best split from existing candidates?

K-best reranking: a standard idea, applied here

BiLSTM-CRF naturally supports k-best decoding. We switched our latest trained BiLSTM from top-1 to top-3 output and measured oracle coverage:

Output modeCorrect answersCoverage
Top-14,64592.9%
Top-3 oracle4,89998.0%

On our benchmark, 98% oracle coverage means the correct answer is almost always in the top 3. If we shift perspective, the problem changes from “how to split” to “how to pick.” Picking the best from three candidates is a simple task. Simple enough that even a small LLM might be able to do it.

What happens if the LLM is used to select instead of segment?

In our earlier experiments, LLMs were asked to generate segmentations directly, and they struggled. A fine-tuned 9B model reached 90.1% but introduced character mutations and over-segmentation. The world knowledge was there, but the generative format corrupted its delivery.

What happens when we switch their role from generation to selection?

The 122B result. We gave Qwen3.5-122B-A10B (MoE, 10B active parameters) BiLSTM’s top-3 candidates for each of the 5,000 benchmark domains. A simple few-shot prompt, no chain-of-thought: just pick the best one.

ConfigurationLenient EMRescueDamageNET
BiLSTM top-1 alone4,645/5,000
BiLSTM top-3 + 122B selection4,835/5,00020111+190

4,835 out of 5,000. The same LLM that struggled with generation excels at selection. It rescues 201 BiLSTM errors while introducing only 11 new ones, a damage rate of 0.2%. The world knowledge that caused character mutations in generative mode is now working precisely as intended: recognizing which candidate contains real words, real brands, real names.

But 122B is not a production model. For a pipeline processing hundreds of thousands of domains daily, it is too slow and too expensive.

How small can we go? We tested 4 Qwen models with the same prompt and candidates (the 35B-A3B is from the Qwen3.6 release, included as an additional test):

ModelRescueDamageNET
Qwen3.5-2B1451,582-1,437
Qwen3.5-9B10256+46
Qwen3.6-35B-A3B181110+71
Qwen3.5-122B-A10B20111+190

The rescue and damage columns reveal what is actually happening. The 2B model rescues 145 BiLSTM errors. It has world knowledge. But it also damages 1,582 correct answers, far more than it fixes. The 9B model rescues fewer (102) but damages only 56, enough restraint to turn NET positive. The 122B model does both: highest rescue (201) and lowest damage (11).

Between the 2B and 9B, Qwen also offers a 4B model. We decided to use it as our baseline training experiment: small enough to iterate quickly, large enough that it might carry useful world knowledge.

Teaching a 4B Model Through Comparative CoT

We trained Qwen3.5-4B with LoRA using Comparative Chain-of-Thought data. The core idea: given a domain like chatgptlogin and its candidates (chat gpt login, chatgpt login, chatg pt login), which one is the best segmentation, and why? We generated 200,000 such comparisons with reasoning using a combination of DeepSeek V4 Flash, Gemini 3.1 Flash Lite, Claude Sonnet 4.6, and Claude Opus 4.6 as teacher models, quality-filtered the results, and used them as training data for the 4B model.

We had already run smaller experiments at 10K and 50K scale. Each row below shows the best epoch for that dataset size:

Training dataRescueDamageNET
Zero-shot167249-82
10K180230-50
50K189195-6
200K193146+47

200K is the first dataset size where NET turns positive for a 4B model. But positive NET does not mean the model is ready to run on all predictions. At 200K, the model makes 193 correct changes and 146 wrong ones. That is a ratio of roughly 4 correct interventions for every 3 wrong ones, far too noisy for unsupervised deployment. NET is a useful signal that the direction is right, not a claim that the model is finished. Three observations are worth discussing.

Data volume is the strongest lever. NET improves steadily from -82 to +47 across four data points. The gain comes almost entirely from damage reduction; rescue is approaching a plateau. Whether more data continues to reduce damage at this rate is an open question with only four data points to extrapolate from.

Training primarily reduces damage, not increases rescue. Rescue grows slowly from 167 to 193 (+26). Damage drops sharply from 249 to 146 (-103). Training is teaching the model when not to change an answer. In fact, 62% of remaining damage cases are the model choosing to keep the domain unsplit, when BiLSTM had correctly split it. The model has learned caution, perhaps too much: when uncertain, it defaults to “do not split” rather than risk a wrong boundary.

Training unlocks pretrained knowledge for this task.

InputBiLSTM (wrong)4B selected (correct)Knowledge usedIn training data?
tesamorelinreviewtesa morelin reviewtesamorelin reviewPharmaceutical drug nameNo
digitalpflegezentrumdigital pflege zentrumdigital pflegezentrumGerman compound nounNo
banglalinkennovatorsbangla link ennovatorsbanglalink ennovatorsBangladeshi telecom brandNo
quefaireenperigordnoirque faire en peri gord noirque faire en perigord noirFrench region nameNo
fawaidalqulubfawaidal qulubfawaid al qulubArabic phraseNo

The 4B model’s reasoning is explicit in its chain-of-thought output. For tesamorelinreview, it writes: “Tesamorelin is a well-known pharmaceutical brand name” and “Splits the brand name into ‘tesa’ and ‘morelin’, which are not the intended components of the drug name.” For digitalpflegezentrum: “Pflegezentrum is a common, meaningful German compound noun; splitting it is less natural.” We also found that the model correctly identifies proper names from various cultures, keeping them intact rather than splitting them into fragments.

None of these words appear in the 200K training set. The knowledge itself comes from pretraining. What the CoT training provided is the interface: how to apply that knowledge when comparing candidates, when to trust it, and when to hold back. The 200K examples did not teach the model what “Pflegezentrum” means. They taught it that when one candidate preserves a recognized compound and another breaks it apart, the first is usually better.

Are the damage cases really damage? We looked at the 146 cases our benchmark counts as damage. In 78% of them, the model keeps a compound intact, claiming it is a brand name. Spot-checking suggests many are plausible: currenlinksystem becomes currenlink system, stormbreakstudios becomes stormbreak studios. Neither word appears in the training data. The model is reasoning about which compounds look intentional, not retrieving memorized brands.

Our benchmark has been through multiple rounds of human and LLM-assisted review, but newly coined brand names remain a blind spot for any annotator. What matters is that these corrections come from the model’s pretrained knowledge, not from the training data. This is exactly what we set out to find in the first post of this series: a way to use the LLM’s world knowledge for segmentation. The selection format, combined with CoT training, finally delivers it.

What Comes Next

On a simple selection task like this, seeing a 4B model produce positive and improving results is very encouraging for us. Across four data points, NET moved from -82 to +47, and the model’s rescue count (193) approaches the 122B result(201). Whether this trend continues with more data or plateaus soon is something we cannot predict from four points alone. The reported damage of 146 may also overstate the true error rate, as we discussed above. This is our first training round on this task, and we are satisfied with the direction.

What we are building toward is an agentic workflow for domain registration analysis. The selector is one component: BiLSTM generates candidates, a trained LLM picks the best one. But running this selector on every domain is inefficient when the baseline is already correct 93% of the time. What if we add an even simpler gate before the selector, one that decides which predictions need a second look and which can pass through unchanged? 

Beyond segmentation, many domain names carry intent signals on their own. During CoT data construction, we noticed the teacher models already identifying languages, flagging brand names, and categorizing niches as part of their reasoning. With the right training, a small model might extract these signals directly from the domain string, adding useful dimensions for domain intelligence beyond just where the word boundaries are. In our recent analysis of 38 million .com domains, keyword trend detection relied on segmentation as input. Deeper reasoning about each domain’s intent could make that kind of analysis significantly more precise.


DKSplit on EuroHPC Series

  1. A Two-Week Journey on EuroHPC Leonardo
  2. DKSplit Update: Cleaner Benchmark, First DeBERTa Run, Different Failure Modes
  3. Searching for a Teacher Model Across Architectures
  4. From Domain Segmentation to Reading Domain Signals
  5. CharBERT and ByT5-CRF
  6. From Splitting Domains to Picking the Best Split (this post)

Models tested on a 5,000-sample multi-method audited benchmark (benchmark_5000), which does not fully cover all real-world scenarios. Lenient EM accepts matches against truth or might_right. NET = rescue – damage, where rescue means the selector corrects a BiLSTM error and damage means it introduces one. All counts are on the full 5,000-sample set unless noted. This is an engineering evaluation. See our midterm report for benchmark methodology.

This work uses models from the Qwen family: Qwen3.5-2B, Qwen3.5-4B, Qwen3.5-9B, and Qwen3.5-122B-A10B (Qwen Team, Alibaba Cloud, Apache 2.0). Training data scoring used DeepSeek V4 Flash, Gemini 3.1 Flash Lite, Claude Sonnet 4.6 and Claude Opus 4.6.


We acknowledge the European High Performance Computing Joint Undertaking (EuroHPC JU) for awarding this project access to the Leonardo supercomputer, hosted by CINECA in Italy.

Co-funded by the European Union. Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European High Performance Computing Joint Undertaking.