Chinese LLM API Evaluation Checklist

Chinese LLM APIs should be evaluated with real product tasks, not only benchmark scores. Use this checklist before sending production traffic.

Quality

Test correctness, tone, formatting, reasoning, hallucination risk, and refusal behavior.

Verify OpenAI-compatible SDK usage, streaming, structured output, tool calling, and error formats.

Measure input tokens, output tokens, retries, long context, and cost per successful task.

Test from your production region and measure time to first token.

Document vendors, data flows, retention, and customer controls.

The best Chinese LLM API depends on workload. Evaluate DeepSeek, Qwen, Kimi, MiniMax, GLM, and Doubao with the same test set and route accordingly.