尝试计算厂商做 inference 的成本比开源方案的成本便宜多少

为了简单，先限定框架和模型：现在大规模 inference 框架主要就 SGLang 和 vLLM，我们以 SGLang 为例；模型我们就用 DeepSeek-V3 为例。

从 SGLang 的 issue 区我们可以获得这些信息

部署 DeepSeek-V3 大概需要 H200 * 8
H200 * 8 可以在 concurrency = 32 的情况下达到大概 1k token/s

下面是具体计算

https://github.com/sgl-project/sglang/issues/3196#issuecomment-2645916906 给出了一个 H200 * 8 随着 concurrency 变化的 throughput 测试结果

表中的 token/s 应该是每个用户的，如果去计算整体的 token/s 可以得到

可以看到非常的 scale，线性增长趋势甚至没有减弱的倾向。

这说明，在 concurrency 足够的情况下，throughput 可超过 1k token/s. 但是注意，这是在 input prompt 很短的情况下，实际上如果用较长的 prompt，SGLang 的 throughput 肯定会大幅度降低。

https://github.com/sgl-project/sglang/issues/3196#issuecomment-2644717500

As above, I focused on H200 because MI300X was just too slow. I’ve moved on. H2008 with R1 or V3 get about 30-35 tokens/sec, so not bad. Definitely looking forward to sglang pushing frontier of these models getting closer to first party API of 60 tokens/sec. On H2008 it can reach 50 token/s with torch compile

我们再考虑到这里说的 torch compile 加速，以及 concurrency 并没有拉满，实际整体吞吐量会更高。

总之，我们先非常粗糙的基于 1k token/s 来算，这样开源方案的成本就是 3.6M token/h，参考 https://www.apetops.com/ H200 * 8 机器报价为 120000 元/月，也就是 166 元 / 小时，也就是每 M tokens 要 166 / 3.6 = 46 元

而 DeepSeek API 的报价是 8 元 / M tokens。

如果我们认为这就是成本价了，则他们内部做 inference 的成本比上面计算的开源方案 (一个 nobody 去租服务器 + 使用开源方案 + concurrency 能保持在 32 以上)，还要便宜 5 倍左右。

那这个差距我感觉已经是一个比较能看得清摸得着的距离了。厂商有太多 card 可以 play 了，性价比拉个 5 倍很轻松。这也说明现在的开源方案在 MLSys 这一 research community 的推动下已经很强了。