{"exhaustive":{"nbHits":true,"typo":true},"exhaustiveNbHits":true,"exhaustiveTypo":true,"hits":[{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"wskwon"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"<em>vLLM</em>: Easy, Fast, and Cheap LLM Serving with PagedAttention"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://<em>vllm</em>.ai/"}},"_tags":["story","author_wskwon","story_36409082"],"author":"wskwon","children":[36409083,36409422,36410971,36410974,36411090,36411111,36411257,36411375,36411560,36411717,36411778,36412091,36412199,36413305,36413909],"created_at":"2023-06-20T19:17:32Z","created_at_i":1687288652,"num_comments":42,"objectID":"36409082","points":295,"story_id":36409082,"title":"vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention","updated_at":"2026-02-11T02:46:10Z","url":"https://vllm.ai/"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"yz-yu"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Nano-<em>vLLM</em>: How a <em>vLLM</em>-style inference engine works"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://neutree.ai/blog/nano-<em>vllm</em>-part-1"}},"_tags":["story","author_yz-yu","story_46855447"],"author":"yz-yu","children":[46856317,46857253,46859173,46860227,46869493,46869877],"created_at":"2026-02-02T12:52:35Z","created_at_i":1770036755,"num_comments":27,"objectID":"46855447","points":271,"story_id":46855447,"title":"Nano-vLLM: How a vLLM-style inference engine works","updated_at":"2026-03-06T03:02:42Z","url":"https://neutree.ai/blog/nano-vllm-part-1"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"yu3zhou4"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Show HN: Tiny-<em>vLLM</em> \u2013 high performance LLM inference engine in C++ and CUDA"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://github.com/jmaczan/tiny-<em>vllm</em>"}},"_tags":["story","author_yu3zhou4","story_48328184","show_hn"],"author":"yu3zhou4","children":[48328913,48328953,48329707,48329980,48329982,48330003,48330136,48331721,48332029,48332174,48333392,48333867,48334472,48337813,48342660,48343580,48348971,48363579],"created_at":"2026-05-29T19:38:27Z","created_at_i":1780083507,"num_comments":18,"objectID":"48328184","points":205,"story_id":48328184,"title":"Show HN: Tiny-vLLM \u2013 high performance LLM inference engine in C++ and CUDA","updated_at":"2026-06-10T17:50:27Z","url":"https://github.com/jmaczan/tiny-vllm"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"samaysharma"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Life of an inference request (<em>vLLM</em> V1): How LLMs are served efficiently at scale"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://www.ubicloud.com/blog/life-of-an-inference-request-<em>vllm</em>-v1"}},"_tags":["story","author_samaysharma","story_44407058"],"author":"samaysharma","children":[44408397,44409281,44409637,44410063,44410575],"created_at":"2025-06-28T18:42:05Z","created_at_i":1751136125,"num_comments":21,"objectID":"44407058","points":175,"story_id":44407058,"title":"Life of an inference request (vLLM V1): How LLMs are served efficiently at scale","updated_at":"2025-11-05T05:23:46Z","url":"https://www.ubicloud.com/blog/life-of-an-inference-request-vllm-v1"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"robertnishihara"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"<em>vLLM</em> large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://blog.<em>vllm</em>.ai/2025/12/17/large-scale-serving.html"}},"_tags":["story","author_robertnishihara","story_46602737"],"author":"robertnishihara","children":[46611750,46611762,46612042,46612045,46612532,46613438,46613887,46614079,46614173,46615087],"created_at":"2026-01-13T15:59:59Z","created_at_i":1768319999,"num_comments":54,"objectID":"46602737","points":147,"story_id":46602737,"title":"vLLM large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep","updated_at":"2026-03-05T23:22:30Z","url":"https://blog.vllm.ai/2025/12/17/large-scale-serving.html"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"theanonymousone"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"KVarN: Native <em>vLLM</em> backend for KV-cache quantization by Huawei"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/huawei-csl/KVarN"}},"_tags":["story","author_theanonymousone","story_48399974"],"author":"theanonymousone","children":[48400484,48400498,48401659,48405168,48409738,48411026,48413673],"created_at":"2026-06-04T15:18:00Z","created_at_i":1780586280,"num_comments":16,"objectID":"48399974","points":143,"story_id":48399974,"title":"KVarN: Native vLLM backend for KV-cache quantization by Huawei","updated_at":"2026-06-07T02:03:25Z","url":"https://github.com/huawei-csl/KVarN"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"simonpure"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Nano-<em>Vllm</em>: Lightweight <em>vLLM</em> implementation built from scratch"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://github.com/GeeeekExplorer/nano-<em>vllm</em>"}},"_tags":["story","author_simonpure","story_44352615"],"author":"simonpure","children":[44353614,44354322,44354480,44354627,44354666,44354802,44354872,44354978,44355487,44355956,44357159],"created_at":"2025-06-23T05:10:44Z","created_at_i":1750655444,"num_comments":16,"objectID":"44352615","points":125,"story_id":44352615,"title":"Nano-Vllm: Lightweight vLLM implementation built from scratch","updated_at":"2025-11-05T05:43:16Z","url":"https://github.com/GeeeekExplorer/nano-vllm"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"berlianta"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Eagle 3.1: Collaboration Between the EAGLE Team, <em>vLLM</em> Team, and TorchSpec Team"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://<em>vllm</em>.ai/blog/2026-05-26-eagle-3-1"}},"_tags":["story","author_berlianta","story_48278407"],"author":"berlianta","children":[48278814,48279042,48280011,48280355,48282914],"created_at":"2026-05-26T11:46:10Z","created_at_i":1779795970,"num_comments":24,"objectID":"48278407","points":69,"story_id":48278407,"title":"Eagle 3.1: Collaboration Between the EAGLE Team, vLLM Team, and TorchSpec Team","updated_at":"2026-06-02T04:53:06Z","url":"https://vllm.ai/blog/2026-05-26-eagle-3-1"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"lukebechtel"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Surpassing <em>vLLM</em> with a Generated Inference Stack"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://infinity.inc/case-studies/qwen3-optimization"}},"_tags":["story","author_lukebechtel","story_47324364"],"author":"lukebechtel","children":[47327386,47327411,47328308,47329186,47330332,47332192,47333926,47334028],"created_at":"2026-03-10T15:12:52Z","created_at_i":1773155572,"num_comments":22,"objectID":"47324364","points":62,"story_id":47324364,"title":"Surpassing vLLM with a Generated Inference Stack","updated_at":"2026-03-15T00:29:45Z","url":"https://infinity.inc/case-studies/qwen3-optimization"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"waybarrios"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"<em>vLLM</em>-MLX \u2013 Run LLMs on Mac at 464 tok/s"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://github.com/waybarrios/<em>vllm</em>-mlx"}},"_tags":["story","author_waybarrios","story_46642846"],"author":"waybarrios","children":[46642847],"created_at":"2026-01-16T03:58:21Z","created_at_i":1768535901,"num_comments":3,"objectID":"46642846","points":33,"story_id":46642846,"title":"vLLM-MLX \u2013 Run LLMs on Mac at 464 tok/s","updated_at":"2026-05-12T10:58:16Z","url":"https://github.com/waybarrios/vllm-mlx"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"jxmorris12"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"<em>VLLM</em>: Easy, Fast, and Cheap LLM Serving with PagedAttention"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://blog.<em>vllm</em>.ai/2023/06/20/<em>vllm</em>.html"}},"_tags":["story","author_jxmorris12","story_44446280"],"author":"jxmorris12","children":[44468079,44468106,44469560],"created_at":"2025-07-02T17:16:20Z","created_at_i":1751476580,"num_comments":5,"objectID":"44446280","points":20,"story_id":44446280,"title":"VLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention","updated_at":"2025-07-05T10:29:47Z","url":"https://blog.vllm.ai/2023/06/20/vllm.html"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"helloericsf"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Is LMDeploy the Ultimate Solution? Why It Outshines <em>VLLM</em>, TRT-LLM, TGI, and MLC"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://bentoml.com/blog/benchmarking-llm-inference-backends"}},"_tags":["story","author_helloericsf","story_40740065"],"author":"helloericsf","children":[40740092,40740252,40740260,40740508],"created_at":"2024-06-20T15:48:34Z","created_at_i":1718898514,"num_comments":8,"objectID":"40740065","points":16,"story_id":40740065,"title":"Is LMDeploy the Ultimate Solution? Why It Outshines VLLM, TRT-LLM, TGI, and MLC","updated_at":"2024-09-20T17:15:45Z","url":"https://bentoml.com/blog/benchmarking-llm-inference-backends"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"chaoyu"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Benchmarking LLM Inference Back Ends: <em>VLLM</em>, LMDeploy, MLC-LLM, TensorRT-LLM, TGI"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://www.bentoml.com/blog/benchmarking-llm-inference-backends"}},"_tags":["story","author_chaoyu","story_40886218"],"author":"chaoyu","children":[40892750,40894829],"created_at":"2024-07-05T21:32:55Z","created_at_i":1720215175,"num_comments":1,"objectID":"40886218","points":15,"story_id":40886218,"title":"Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI","updated_at":"2025-05-01T04:07:04Z","url":"https://www.bentoml.com/blog/benchmarking-llm-inference-backends"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"skeptrune"},"story_text":{"matchLevel":"none","matchedWords":[],"value":"If you just want to use it, try here - <a href=\"https://pdf2md.trieve.ai\">https://pdf2md.trieve.ai</a> . I think the LLM's are astoundingly good at converting complex powerpoint style infographics.<p>I wouldn't normally think folks on HN would find this interesting as the general concept has been posted about already in the past few months. We were heavily inspired by Zerox[1].<p>However, the stack we went with was fun and over-engineered which is more likely to create interesting discussion. We use all the same tools at Trieve (our main product), but wanted to see if they would be a good fit for something that needed to get built in a tighter timeline and we think they were!<p>Took us 2 weeks to get this setup end-to-end and it's by no means complete (see roadmap in linked README). However, it's cool that a relatively cookie cutter web service like this can be created with pure open-source dependencies and non-standard Rust tooling so quickly. Rust won't kill your startup!<p>- Minijinja templates for the UI[2]<p>- PDFObject for doc display in-browser[3]<p>- actix/actix-web HTTP server framework[4]<p>- Redis queue macro for worker async processing[5]<p>- Clickhouse for task storage[6]<p>- chm CLI to handle Clickhouse migrations[7]<p>- MinIO S3 for object storage[8]<p>[1]: <a href=\"https://news.ycombinator.com/item?id=41048194\">https://news.ycombinator.com/item?id=41048194</a><p>[2]: <a href=\"https://github.com/mitsuhiko/minijinja\">https://github.com/mitsuhiko/minijinja</a><p>[3]: <a href=\"https://github.com/pipwerks/pdfobject\">https://github.com/pipwerks/pdfobject</a><p>[4]: <a href=\"https://github.com/actix/actix-web\">https://github.com/actix/actix-web</a><p>[5]: <a href=\"https://github.com/devflowinc/trieve/blob/main/pdf2md/server/src/operators/redis.rs#L7-L62\">https://github.com/devflowinc/trieve/blob/main/pdf2md/server...</a><p>[6]: <a href=\"https://github.com/ClickHouse/ClickHouse\">https://github.com/ClickHouse/ClickHouse</a><p>[7]: <a href=\"https://docs.rs/chm/latest/chm/index.html\" rel=\"nofollow\">https://docs.rs/chm/latest/chm/index.html</a><p>[8]: <a href=\"https://github.com/minio/minio\">https://github.com/minio/minio</a>"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Show HN: PDF2MD \u2013 Rust+Redis+ClickHouse+<em>VLLM</em> conversion pipeline for PDFs"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/devflowinc/trieve/tree/main/pdf2md"}},"_tags":["story","author_skeptrune","story_42208648","show_hn"],"author":"skeptrune","created_at":"2024-11-21T21:05:41Z","created_at_i":1732223141,"num_comments":0,"objectID":"42208648","points":13,"story_id":42208648,"story_text":"If you just want to use it, try here - <a href=\"https:&#x2F;&#x2F;pdf2md.trieve.ai\">https:&#x2F;&#x2F;pdf2md.trieve.ai</a> . I think the LLM&#x27;s are astoundingly good at converting complex powerpoint style infographics.<p>I wouldn&#x27;t normally think folks on HN would find this interesting as the general concept has been posted about already in the past few months. We were heavily inspired by Zerox[1].<p>However, the stack we went with was fun and over-engineered which is more likely to create interesting discussion. We use all the same tools at Trieve (our main product), but wanted to see if they would be a good fit for something that needed to get built in a tighter timeline and we think they were!<p>Took us 2 weeks to get this setup end-to-end and it&#x27;s by no means complete (see roadmap in linked README). However, it&#x27;s cool that a relatively cookie cutter web service like this can be created with pure open-source dependencies and non-standard Rust tooling so quickly. Rust won&#x27;t kill your startup!<p>- Minijinja templates for the UI[2]<p>- PDFObject for doc display in-browser[3]<p>- actix&#x2F;actix-web HTTP server framework[4]<p>- Redis queue macro for worker async processing[5]<p>- Clickhouse for task storage[6]<p>- chm CLI to handle Clickhouse migrations[7]<p>- MinIO S3 for object storage[8]<p>[1]: <a href=\"https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=41048194\">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=41048194</a><p>[2]: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;mitsuhiko&#x2F;minijinja\">https:&#x2F;&#x2F;github.com&#x2F;mitsuhiko&#x2F;minijinja</a><p>[3]: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;pipwerks&#x2F;pdfobject\">https:&#x2F;&#x2F;github.com&#x2F;pipwerks&#x2F;pdfobject</a><p>[4]: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;actix&#x2F;actix-web\">https:&#x2F;&#x2F;github.com&#x2F;actix&#x2F;actix-web</a><p>[5]: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;devflowinc&#x2F;trieve&#x2F;blob&#x2F;main&#x2F;pdf2md&#x2F;server&#x2F;src&#x2F;operators&#x2F;redis.rs#L7-L62\">https:&#x2F;&#x2F;github.com&#x2F;devflowinc&#x2F;trieve&#x2F;blob&#x2F;main&#x2F;pdf2md&#x2F;server...</a><p>[6]: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;ClickHouse&#x2F;ClickHouse\">https:&#x2F;&#x2F;github.com&#x2F;ClickHouse&#x2F;ClickHouse</a><p>[7]: <a href=\"https:&#x2F;&#x2F;docs.rs&#x2F;chm&#x2F;latest&#x2F;chm&#x2F;index.html\" rel=\"nofollow\">https:&#x2F;&#x2F;docs.rs&#x2F;chm&#x2F;latest&#x2F;chm&#x2F;index.html</a><p>[8]: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;minio&#x2F;minio\">https:&#x2F;&#x2F;github.com&#x2F;minio&#x2F;minio</a>","title":"Show HN: PDF2MD \u2013 Rust+Redis+ClickHouse+VLLM conversion pipeline for PDFs","updated_at":"2024-12-30T23:52:44Z","url":"https://github.com/devflowinc/trieve/tree/main/pdf2md"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"sherlockxu"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Benchmarking LLM Inference Back Ends: <em>VLLM</em>, LMDeploy, MLC-LLM, TRT-LLM, and TGI"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://bentoml.com/blog/benchmarking-llm-inference-backends"}},"_tags":["story","author_sherlockxu","story_40601794"],"author":"sherlockxu","children":[40601960,40604730,40605032],"created_at":"2024-06-06T20:08:00Z","created_at_i":1717704480,"num_comments":2,"objectID":"40601794","points":12,"story_id":40601794,"title":"Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TRT-LLM, and TGI","updated_at":"2024-09-20T17:09:35Z","url":"https://bentoml.com/blog/benchmarking-llm-inference-backends"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"zhwu"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Serving LLM 24x Faster on the Cloud with <em>VLLM</em> and SkyPilot"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-<em>vllm</em>-and-skypilot/"}},"_tags":["story","author_zhwu","story_36523357"],"author":"zhwu","children":[36523632],"created_at":"2023-06-29T17:11:17Z","created_at_i":1688058677,"num_comments":1,"objectID":"36523357","points":12,"story_id":36523357,"title":"Serving LLM 24x Faster on the Cloud with VLLM and SkyPilot","updated_at":"2024-09-20T14:23:54Z","url":"https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"btwillard"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Our project, Outlines, now offers guided/constrained generation (e.g. according to a JSON schema) via the <em>VLLM</em> library.<p>My colleague, R\u00e9mi, created some patches that allow one to pass <em>vLLM</em> a JSON schema along with the prompt, which dramatically simplifies deployment of JSON-guided generation.  He also added a new `serve` interface that puts it all together and makes serving such models a 2-3 line process.<p>Check it out and tell us what you think!"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Show HN: <em>VLLM</em> with JSON Guided Generation"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://outlines-dev.github.io/outlines/reference/<em>vllm</em>/"}},"_tags":["story","author_btwillard","story_38735503","show_hn"],"author":"btwillard","children":[38735564],"created_at":"2023-12-22T16:12:11Z","created_at_i":1703261531,"num_comments":3,"objectID":"38735503","points":11,"story_id":38735503,"story_text":"Our project, Outlines, now offers guided&#x2F;constrained generation (e.g. according to a JSON schema) via the VLLM library.<p>My colleague, R\u00e9mi, created some patches that allow one to pass vLLM a JSON schema along with the prompt, which dramatically simplifies deployment of JSON-guided generation.  He also added a new `serve` interface that puts it all together and makes serving such models a 2-3 line process.<p>Check it out and tell us what you think!","title":"Show HN: VLLM with JSON Guided Generation","updated_at":"2024-09-20T15:58:59Z","url":"https://outlines-dev.github.io/outlines/reference/vllm/"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"twelvenmonkeys"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Free <em>vLLM</em> Course: Inference, Compression, Benchmarks"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://www.deeplearning.ai/courses/fast-and-efficient-llm-inference-with-<em>vllm</em>"}},"_tags":["story","author_twelvenmonkeys","story_48386932"],"author":"twelvenmonkeys","children":[48386933],"created_at":"2026-06-03T17:27:14Z","created_at_i":1780507634,"num_comments":0,"objectID":"48386932","points":8,"story_id":48386932,"title":"Free vLLM Course: Inference, Compression, Benchmarks","updated_at":"2026-06-05T14:16:48Z","url":"https://www.deeplearning.ai/courses/fast-and-efficient-llm-inference-with-vllm"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"pember"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Heaps do lie: debugging a memory leak in <em>vLLM</em>"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"https://mistral.ai/news/debugging-memory-leak-in-<em>vllm</em>"}},"_tags":["story","author_pember","story_46707015"],"author":"pember","children":[46824814],"created_at":"2026-01-21T15:27:39Z","created_at_i":1769009259,"num_comments":1,"objectID":"46707015","points":7,"story_id":46707015,"title":"Heaps do lie: debugging a memory leak in vLLM","updated_at":"2026-03-05T23:22:30Z","url":"https://mistral.ai/news/debugging-memory-leak-in-vllm"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"ericcurtin"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Hi HN, I\u2019m one of the authors of this post.<p>We\u2019ve updated Docker Model Runner to support <em>vLLM</em> alongside the existing llama.cpp backend. The goal is to bridge the gap between local prototyping (often done with GGUF/llama.cpp) and high-throughput production (often done with Safetensors/<em>vLLM</em>) using a consistent Docker workflow.<p>Key technical details:<p>Auto-routing: The tool detects the model format. If you pull a GGUF model, it routes to llama.cpp. If you pull a Safetensors model, it routes to <em>vLLM</em>.<p>API: It exposes an OpenAI-compatible API (/v1/chat/completions), so the client code doesn't need to change based on the backend.<p>Usage: It\u2019s just docker model run ai/smollm2-<em>vllm</em>.<p>Current Limitations:<p>Right now, the <em>vLLM</em> backend is optimized for x86_64 with Nvidia GPUs.<p>We are actively working on WSL2 support for Windows users and DGX Spark compatibility.<p>Happy to answer any questions about the integration or the roadmap!<p><a href=\"https://www.docker.com/blog/docker-model-runner-integrates-vllm/\" rel=\"nofollow\">https://www.docker.com/blog/docker-model-runner-integrates-v...</a>"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["vllm"],"value":"Show HN: Docker Model Runner Integrates <em>vLLM</em> for High-Throughput Inference"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/docker/model-runner"}},"_tags":["story","author_ericcurtin","story_45996081","show_hn"],"author":"ericcurtin","children":[45996088],"created_at":"2025-11-20T18:41:51Z","created_at_i":1763664111,"num_comments":1,"objectID":"45996081","points":7,"story_id":45996081,"story_text":"Hi HN, I\u2019m one of the authors of this post.<p>We\u2019ve updated Docker Model Runner to support vLLM alongside the existing llama.cpp backend. The goal is to bridge the gap between local prototyping (often done with GGUF&#x2F;llama.cpp) and high-throughput production (often done with Safetensors&#x2F;vLLM) using a consistent Docker workflow.<p>Key technical details:<p>Auto-routing: The tool detects the model format. If you pull a GGUF model, it routes to llama.cpp. If you pull a Safetensors model, it routes to vLLM.<p>API: It exposes an OpenAI-compatible API (&#x2F;v1&#x2F;chat&#x2F;completions), so the client code doesn&#x27;t need to change based on the backend.<p>Usage: It\u2019s just docker model run ai&#x2F;smollm2-vllm.<p>Current Limitations:<p>Right now, the vLLM backend is optimized for x86_64 with Nvidia GPUs.<p>We are actively working on WSL2 support for Windows users and DGX Spark compatibility.<p>Happy to answer any questions about the integration or the roadmap!<p><a href=\"https:&#x2F;&#x2F;www.docker.com&#x2F;blog&#x2F;docker-model-runner-integrates-vllm&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;www.docker.com&#x2F;blog&#x2F;docker-model-runner-integrates-v...</a>","title":"Show HN: Docker Model Runner Integrates vLLM for High-Throughput Inference","updated_at":"2026-04-01T17:11:45Z","url":"https://github.com/docker/model-runner"}],"hitsPerPage":20,"nbHits":307689,"nbPages":50,"page":0,"params":"query=vllm&advancedSyntax=true&analyticsTags=backend","processingTimeMS":25,"processingTimingsMS":{"_request":{"roundTrip":13},"fetch":{"query":2,"scanning":21,"total":24},"total":25},"query":"vllm","serverTimeMS":26}