The Chinese open-source AI stack — a practical guide — Michael English

Most western teams in 2026 still treat the Chinese open-source model ecosystem as either an academic curiosity or a vague compliance risk. Both framings cost the team money. The reality on the ground is that the open-source frontier — the models you can pull, run on your own hardware, and put into production with no per-call fees — is overwhelmingly Chinese. If you are not running at least one of these models for at least one workload, you are paying frontier-API prices for work that does not need a frontier API.

This is a practical guide. We will not argue about geopolitics. The question is which model you should be reaching for next Wednesday morning when an engineer asks where the workload should run.

The lineage tree, briefly

The open-source side has consolidated around a handful of model families. The lineage matters because the architecture decisions and the training-data choices propagate down the tree, and the failure modes are inherited.

DeepSeek — produced some of the strongest open-weights coding and reasoning models of the last two years. The MoE architectures used by DeepSeek-V3 and the R1 reasoning line are the reference design that most other teams chase. Strong on code, on math, on long-context reasoning.
Qwen (Alibaba) — broad capability profile, the largest open-weight family by parameter range, well-documented, with strong tool-use behaviour. The default choice when you don't have a reason to choose otherwise.
GLM (Zhipu) — exceptionally strong on Chinese-language and bilingual tasks, with a permissive licence on the smaller models. Often the right pick for content workloads that touch CJK text.
MiniMax — interesting agentic-behaviour profile and very long context. Its tool-use traces look noticeably different from the western frontier, in ways that occasionally help and occasionally hurt.
Kimi (Moonshot) — long-context specialist; pick when the work is “here is a multi-megabyte document, do something with it.”
Yi (01.AI) — the lighter family. Fast, cheap, well-behaved on small specialised tasks. Usually the right fine-tune base for a narrow worker agent.

When each one is the right choice

Below is the cheat-sheet we use at IMPT.io. It is opinionated, and your mileage will vary, but it should accelerate your first three weeks.

Code generation, refactor, code review — DeepSeek-Coder for fast cheap iteration, Qwen-Coder for higher quality on architecturally meaningful work.
Reasoning-heavy single-shot work — DeepSeek-R1 line. The chain-of-thought traces are long but the answers are good and the cost is dramatically lower than frontier reasoning.
Bulk classification & extraction — a fine-tuned Yi or small-Qwen on your own data. This is the workhorse tier and it should cost you next-to-nothing.
Long-context summarisation — Kimi or Qwen-2.5 long-context. Lower latency and lower cost than the western frontier for the same job.
Bilingual or multilingual content — GLM family.
Agentic tool-use loops — Qwen with structured-output mode is the most reliable open-weights choice today.

Where they aren't the right choice

Three honest caveats. First, the absolute frontier of capability is still the western models — Claude 4.7 Opus, GPT-class flagship, and a handful of others — for the genuinely hard tasks where capability per token is the constraint. If the work is “reason about a novel architecture under uncertainty,” the frontier wins. Second, the safety profiles of the open-weight models are different. They will refuse different things and accept different things, and your evaluation harness needs to test for what your specific use case requires. Third, fine-tuning quality across the families is uneven. The mature ones (Qwen, DeepSeek) are easy to fine-tune well; some of the others are not, and you'll spend more time on the tuning loop than expected.

Where they run

The whole point is that you can run them anywhere. Three sensible deployment patterns:

1. On your own GPU box, on your own premises

The cheapest pattern at volume and the one we use most heavily at IMPT.io. A single mid-tier inference server can serve a 70B-class model at production latencies for a small team. The capex pays back inside a year if the workload is non-trivial.

2. On a European GPU cloud

If you don't want to own hardware, several European providers (Scaleway, Hetzner with H100s, OVHcloud) host these models at well-priced inference endpoints. You keep workloads in EU jurisdiction. We use this for spillover.

3. Behind a managed API in Europe or the US

Together AI, Fireworks, Groq, Cerebras and similar managed providers give you the cheap-per-token economics of the open models with no operational overhead. Your data leaves your network, but it leaves into a US or EU jurisdiction of your choice — not the model's country of origin.

For most regulated EU workloads, pattern (1) or (2) is the answer. Pattern (3) is fine for non-sensitive work and faster to start with.

Data residency and the China question

The most common reason teams give for not using these models is data residency. The reason does not survive contact with how the models actually work. The model weights are open. You download them once. After that, every inference call happens on hardware you control or rent. None of your data goes to the country that produced the model. The model has no telemetry phone-home — they are static binary blobs of weights — and you can verify this by running them in a network-isolated environment, which is in fact how we run several of our most sensitive workloads.

The legitimate concerns are about training data and licence terms. Read the licences. Most of the major open-weights models from these labs ship under permissive commercial licences; a couple have non-commercial restrictions or revenue thresholds. None of this is exotic, but it does need a five-minute legal review per family. We will do that review in the workshop.

What we'll do in the workshop

Day 2 morning of the Clonmel workshop is hands-on. We will pull DeepSeek, Qwen, and Yi onto laptops and onto a shared GPU instance, run identical workloads through each, and look at the outputs side-by-side. By lunchtime everyone in the room will have a working open-source inference setup on their machine and a clear sense of which family fits which slot in their existing stack.

Reserve your seat →

The Chinese open-source AI stack — a practical guide