{"exhaustive":{"nbHits":false,"typo":false},"exhaustiveNbHits":false,"exhaustiveTypo":false,"hits":[{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"ThouYS"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"ironically, <em>coding</em> is the area that small models excel most at. <em>qwen</em> 3.6 27b is an insane <em>agent</em>ic model. but Opus can be much more than a programmer: language tutor, knowledgeable &quot;friend&quot;, therapist"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Gemma 4 12B: A unified, encoder-free multimodal model"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/"}},"_tags":["comment","author_ThouYS","story_48385906"],"author":"ThouYS","comment_text":"ironically, coding is the area that small models excel most at. qwen 3.6 27b is an insane agentic model. but Opus can be much more than a programmer: language tutor, knowledgeable &quot;friend&quot;, therapist","created_at":"2026-06-04T16:36:08Z","created_at_i":1780590968,"objectID":"48401092","parent_id":48397306,"story_id":48385906,"story_title":"Gemma 4 12B: A unified, encoder-free multimodal model","story_url":"https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/","updated_at":"2026-06-04T16:39:48Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"dofm"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"It does appear to have training for javascript and PHP, from what I can see, and pretty solid knowledge of wordpress and woocommerce. I would guess it has beginner-friendly knowledge of Python, too?<p>(Though it is gaslighting me about PHP anonymous functions.)<p>I would not use it to write code (the MoE 26B writes really good PHP), but it appears to have absolutely good enough knowledge to write implementation plans, and I think that could be useful in a sort of <em>agent</em>ic <em>coding</em> tutorial environment.<p>I test these models with simple things. My favourite mini test is asking an AI to write a &quot;last login&quot; tracker facility for wordpress with a sortable admin column, which is trivial code \u2014 only a few lines -- but touches on a reasonably deep bit of the WP API. If you ask it to prompt you with clarifying questions, those questions are quite revealing.<p>It can write the code. Not tested it but I am sure it works. It's not as elegant.<p>It is not as good at understanding nuanced instructions as either the 26B or the sparse <em>Qwen</em> 3.6. There are concise things you can say in a prompt to <em>Qwen</em> 3.6 that have it draw logical conclusions that fully impress me.<p>I am more impressed by it than I expected. I reckon this would be quite useful in a tutorial tool.<p>(I say this as a sort of qualified cynic; I think much of the AI circus is a farce. But if these things are to ever be useful for teaching without making people dependent on some cloud &quot;intelligence tap&quot;, this is progress)"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Gemma 4 12B: A unified, encoder-free multimodal model"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/"}},"_tags":["comment","author_dofm","story_48385906"],"author":"dofm","comment_text":"It does appear to have training for javascript and PHP, from what I can see, and pretty solid knowledge of wordpress and woocommerce. I would guess it has beginner-friendly knowledge of Python, too?<p>(Though it is gaslighting me about PHP anonymous functions.)<p>I would not use it to write code (the MoE 26B writes really good PHP), but it appears to have absolutely good enough knowledge to write implementation plans, and I think that could be useful in a sort of agentic coding tutorial environment.<p>I test these models with simple things. My favourite mini test is asking an AI to write a &quot;last login&quot; tracker facility for wordpress with a sortable admin column, which is trivial code \u2014 only a few lines -- but touches on a reasonably deep bit of the WP API. If you ask it to prompt you with clarifying questions, those questions are quite revealing.<p>It can write the code. Not tested it but I am sure it works. It&#x27;s not as elegant.<p>It is not as good at understanding nuanced instructions as either the 26B or the sparse Qwen 3.6. There are concise things you can say in a prompt to Qwen 3.6 that have it draw logical conclusions that fully impress me.<p>I am more impressed by it than I expected. I reckon this would be quite useful in a tutorial tool.<p>(I say this as a sort of qualified cynic; I think much of the AI circus is a farce. But if these things are to ever be useful for teaching without making people dependent on some cloud &quot;intelligence tap&quot;, this is progress)","created_at":"2026-06-04T14:27:41Z","created_at_i":1780583261,"objectID":"48399183","parent_id":48390110,"story_id":48385906,"story_title":"Gemma 4 12B: A unified, encoder-free multimodal model","story_url":"https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/","updated_at":"2026-06-04T14:42:47Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"dirkg"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"&gt; For 16GB laptops, <em>Qwen</em> 3.5 9B is the undisputed champ.<p>you can run <em>qwen</em> 3.6 35BA3B on a 12-16GB vram gpu and ot works pretty well.<p><a href=\"https://www.youtube.com/watch?v=8F_5pdcD3HY&amp;t=1s\" rel=\"nofollow\">https://www.youtube.com/watch?v=8F_5pdcD3HY&amp;t=1s</a><p>even the 27B in some quants can fit.<p><a href=\"https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27biq4_ks_for_ik_llamacpp_especially_for/\" rel=\"nofollow\">https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27b...</a><p><em>qwen</em> IMO is far better for <em>coding</em>, esp <em>agent</em>ic <em>coding</em> when combined with something like Pi, it comes probably close enough to Sonnet for a lot of use cases.<p>Gemma family is better for almost all other tasks you'd use a local llm for."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Gemma 4 12B: A unified, encoder-free multimodal model"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/"}},"_tags":["comment","author_dirkg","story_48385906"],"author":"dirkg","children":[48398354,48400980,48399213],"comment_text":"&gt; For 16GB laptops, Qwen 3.5 9B is the undisputed champ.<p>you can run qwen 3.6 35BA3B on a 12-16GB vram gpu and ot works pretty well.<p><a href=\"https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=8F_5pdcD3HY&amp;t=1s\" rel=\"nofollow\">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=8F_5pdcD3HY&amp;t=1s</a><p>even the 27B in some quants can fit.<p><a href=\"https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;LocalLLaMA&#x2F;comments&#x2F;1tkmgwj&#x2F;qwen27biq4_ks_for_ik_llamacpp_especially_for&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;LocalLLaMA&#x2F;comments&#x2F;1tkmgwj&#x2F;qwen27b...</a><p>qwen IMO is far better for coding, esp agentic coding when combined with something like Pi, it comes probably close enough to Sonnet for a lot of use cases.<p>Gemma family is better for almost all other tasks you&#x27;d use a local llm for.","created_at":"2026-06-04T05:25:15Z","created_at_i":1780550715,"objectID":"48394339","parent_id":48390110,"story_id":48385906,"story_title":"Gemma 4 12B: A unified, encoder-free multimodal model","story_url":"https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/","updated_at":"2026-06-04T21:43:19Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"SwellJoe"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Yep, I have a Strix Halo and while it <i>can</i> run models bigger than <em>Qwen</em> 3.6 27b, it's not usable interactively when you do. ds4 patched for ROCm works, but at such a slow speed, it's not usable for <em>coding</em> <em>agents</em>.<p>The Nvidia boxes have only slightly more memory bandwidth, so I wouldn't expect them to be notably faster. At least not enough to make it useful interactively at that scale."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Gemma 4 12B: A unified, encoder-free multimodal model"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/"}},"_tags":["comment","author_SwellJoe","story_48385906"],"author":"SwellJoe","children":[48392001],"comment_text":"Yep, I have a Strix Halo and while it <i>can</i> run models bigger than Qwen 3.6 27b, it&#x27;s not usable interactively when you do. ds4 patched for ROCm works, but at such a slow speed, it&#x27;s not usable for coding agents.<p>The Nvidia boxes have only slightly more memory bandwidth, so I wouldn&#x27;t expect them to be notably faster. At least not enough to make it useful interactively at that scale.","created_at":"2026-06-04T00:04:34Z","created_at_i":1780531474,"objectID":"48391881","parent_id":48391775,"story_id":48385906,"story_title":"Gemma 4 12B: A unified, encoder-free multimodal model","story_url":"https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/","updated_at":"2026-06-04T22:34:34Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"fny"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"&gt; A.I. is built on our collective intelligence: our books, songs, artwork, journalism, computer code, scientific research, videos, conversations, images and ideas spanning generations<p>I know many here would scoff at nationalizing a private company, but AI is a usurpation of human knowledge and quite literally at times. (Every AI company was embroiled in copyright lawsuits and lord knows what <em>Qwen</em> et al are up to.)<p>In turn, everyone knows labor displacement is <em>coming</em>. My bet is the next recession will end up being brutal for this reason. To me, labor displacement and the social consequences are a potentially *<i>catastrophic*</i> negative externality. Should not there be a tax to offset the &quot;frictional&quot; unemployment? What happens when people lose a high skill job and will no longer be able to afford their mortgage?<p>Also, why are people always talking about AI as if its an angel or satan? The degree to which we're doomed is an open question, much like a tornado... so why <em>aren't</em> we thinking about taxes on AI like a tornado insurance fund?"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"The Public Should Own Half of the Big A.I. Companies"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://www.sanders.senate.gov/op-eds/the-public-should-own-half-of-the-big-a-i-companies/"}},"_tags":["comment","author_fny","story_48386551"],"author":"fny","children":[48387297,48387324,48387082,48387170,48389023,48388412],"comment_text":"&gt; A.I. is built on our collective intelligence: our books, songs, artwork, journalism, computer code, scientific research, videos, conversations, images and ideas spanning generations<p>I know many here would scoff at nationalizing a private company, but AI is a usurpation of human knowledge and quite literally at times. (Every AI company was embroiled in copyright lawsuits and lord knows what Qwen et al are up to.)<p>In turn, everyone knows labor displacement is coming. My bet is the next recession will end up being brutal for this reason. To me, labor displacement and the social consequences are a potentially *<i>catastrophic*</i> negative externality. Should not there be a tax to offset the &quot;frictional&quot; unemployment? What happens when people lose a high skill job and will no longer be able to afford their mortgage?<p>Also, why are people always talking about AI as if its an angel or satan? The degree to which we&#x27;re doomed is an open question, much like a tornado... so why aren&#x27;t we thinking about taxes on AI like a tornado insurance fund?","created_at":"2026-06-03T17:33:49Z","created_at_i":1780508029,"objectID":"48387013","parent_id":48386551,"story_id":48386551,"story_title":"The Public Should Own Half of the Big A.I. Companies","story_url":"https://www.sanders.senate.gov/op-eds/the-public-should-own-half-of-the-big-a-i-companies/","updated_at":"2026-06-04T19:10:48Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"evilduck"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"I think we are approaching that now, with correct expectations. With frontier large models you can often one-shot tasks with vague prompts for stuff like creating CRUD APIs and dashboards around a simple data model since it's such a solved-problem now. With something like <em>Qwen3</em>.6 27B or 35B-A3B and a Strix Halo level computer or a MBP with 32GB or more or RAM, you may need to be more explicit and stay involved and be a little more patient, but you can absolutely get work done with it or delegate tasks to it successfully.<p>My Framework Desktop does a lot of similar work as my Claude  subscription at work (Cowork, chats) for 100W of power draw and a little patience waiting for a slow GPU with limited memory bandwidth to crunch the numbers. <em>Agent</em>ic <em>coding</em> is obviously weaker but CRUD development and visualization dashboards are within reach, and I'm usually pleasantly surprised at its ability to self-manage devops."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Claude Opus 4.8"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://www.anthropic.com/news/claude-opus-4-8"}},"_tags":["comment","author_evilduck","story_48311647"],"author":"evilduck","comment_text":"I think we are approaching that now, with correct expectations. With frontier large models you can often one-shot tasks with vague prompts for stuff like creating CRUD APIs and dashboards around a simple data model since it&#x27;s such a solved-problem now. With something like Qwen3.6 27B or 35B-A3B and a Strix Halo level computer or a MBP with 32GB or more or RAM, you may need to be more explicit and stay involved and be a little more patient, but you can absolutely get work done with it or delegate tasks to it successfully.<p>My Framework Desktop does a lot of similar work as my Claude  subscription at work (Cowork, chats) for 100W of power draw and a little patience waiting for a slow GPU with limited memory bandwidth to crunch the numbers. Agentic coding is obviously weaker but CRUD development and visualization dashboards are within reach, and I&#x27;m usually pleasantly surprised at its ability to self-manage devops.","created_at":"2026-05-29T06:00:53Z","created_at_i":1780034453,"objectID":"48319582","parent_id":48312899,"story_id":48311647,"story_title":"Claude Opus 4.8","story_url":"https://www.anthropic.com/news/claude-opus-4-8","updated_at":"2026-05-31T12:25:14Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"ramonga"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Hey HN,<p>We believe we have the easiest onboarding from signup to being able to spin up <em>coding</em> <em>agents</em> in slack like Stripe, Ramp &amp; Coinbase.<p>Demo of the onboarding: <a href=\"https://www.tella.tv/video/connecting-cord-to-slack-1-19ep\" rel=\"nofollow\">https://www.tella.tv/video/connecting-cord-to-slack-1-19ep</a><p>Every signup gets free open source models (Kimi K2.6, <em>Qwen</em> 3.6, DeepSeek V4, MiniMax M2.7, Gemma 4, GLM 5.1) + $100 credit for Anthropic/OpenAI. You can also use your API keys &amp; Codex subscription."},"title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["coding"],"value":"Show HN: Free open source <em>coding</em> models in Slack"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://www.runcord.com/"}},"_tags":["story","author_ramonga","story_48310994","show_hn"],"author":"ramonga","created_at":"2026-05-28T16:11:13Z","created_at_i":1779984673,"num_comments":0,"objectID":"48310994","points":3,"story_id":48310994,"story_text":"Hey HN,<p>We believe we have the easiest onboarding from signup to being able to spin up coding agents in slack like Stripe, Ramp &amp; Coinbase.<p>Demo of the onboarding: <a href=\"https:&#x2F;&#x2F;www.tella.tv&#x2F;video&#x2F;connecting-cord-to-slack-1-19ep\" rel=\"nofollow\">https:&#x2F;&#x2F;www.tella.tv&#x2F;video&#x2F;connecting-cord-to-slack-1-19ep</a><p>Every signup gets free open source models (Kimi K2.6, Qwen 3.6, DeepSeek V4, MiniMax M2.7, Gemma 4, GLM 5.1) + $100 credit for Anthropic&#x2F;OpenAI. You can also use your API keys &amp; Codex subscription.","title":"Show HN: Free open source coding models in Slack","updated_at":"2026-05-28T17:28:18Z","url":"https://www.runcord.com/"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"simonw"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"The problem with models like <em>Qwen</em> 3.6 35B (which really is an excellent model) is that my expectations of what a model can do have gone SO high now.<p>Here's a prompt I just ran against Claude Opus 4.7:<p>&gt; Use python3 to experiment with whether the SQLite3 authorizer mechanism can be used to detect an INSERT OR REPLACE based just on running an explain query without examining the SQL string itself<p>Opus nailed it: <a href=\"https://claude.ai/share/c4212606-3fee-4b7c-bc97-505e0348ccac\" rel=\"nofollow\">https://claude.ai/share/c4212606-3fee-4b7c-bc97-505e0348ccac</a><p>I tried the same thing against <em>qwen</em>/qwen3.5-35b-a3b running locally in lmstudio, with the Pi <em>coding</em> <em>agent</em>. At first it looked like it was going to do great! And then it fell apart over the course of several tool calls: <a href=\"https://gisthost.github.io/?8ae2f842df619fb7fd8f1ccd82fe41c7\" rel=\"nofollow\">https://gisthost.github.io/?8ae2f842df619fb7fd8f1ccd82fe41c7</a><p>I'm used to GPT-5.5 and Opus 4.7 handling that kind of prompt without any problems at all."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"I think Anthropic and OpenAI have found product-market fit"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://simonwillison.net/2026/May/27/product-market-fit/"}},"_tags":["comment","author_simonw","story_48296794"],"author":"simonw","children":[48304582,48303721,48303107,48303676],"comment_text":"The problem with models like Qwen 3.6 35B (which really is an excellent model) is that my expectations of what a model can do have gone SO high now.<p>Here&#x27;s a prompt I just ran against Claude Opus 4.7:<p>&gt; Use python3 to experiment with whether the SQLite3 authorizer mechanism can be used to detect an INSERT OR REPLACE based just on running an explain query without examining the SQL string itself<p>Opus nailed it: <a href=\"https:&#x2F;&#x2F;claude.ai&#x2F;share&#x2F;c4212606-3fee-4b7c-bc97-505e0348ccac\" rel=\"nofollow\">https:&#x2F;&#x2F;claude.ai&#x2F;share&#x2F;c4212606-3fee-4b7c-bc97-505e0348ccac</a><p>I tried the same thing against qwen&#x2F;qwen3.5-35b-a3b running locally in lmstudio, with the Pi coding agent. At first it looked like it was going to do great! And then it fell apart over the course of several tool calls: <a href=\"https:&#x2F;&#x2F;gisthost.github.io&#x2F;?8ae2f842df619fb7fd8f1ccd82fe41c7\" rel=\"nofollow\">https:&#x2F;&#x2F;gisthost.github.io&#x2F;?8ae2f842df619fb7fd8f1ccd82fe41c7</a><p>I&#x27;m used to GPT-5.5 and Opus 4.7 handling that kind of prompt without any problems at all.","created_at":"2026-05-27T23:28:01Z","created_at_i":1779924481,"objectID":"48302188","parent_id":48301994,"story_id":48296794,"story_title":"I think Anthropic and OpenAI have found product-market fit","story_url":"https://simonwillison.net/2026/May/27/product-market-fit/","updated_at":"2026-05-31T03:01:42Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"canadiantim"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"So what's best low cost <em>coding</em> <em>agent</em> these days? Kimi 2.6? <em>Qwen</em>'s latest closed model? Composer 2.5? DeepSeek?"},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["coding","agent"],"value":"DeepSeek reasonix, DeepSeek native <em>coding</em> <em>agent</em> with high caching and low cost"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://esengine.github.io/DeepSeek-Reasonix/"}},"_tags":["comment","author_canadiantim","story_48256953"],"author":"canadiantim","children":[48257558,48257629,48257589,48258381,48257603,48257723,48259655,48257578],"comment_text":"So what&#x27;s best low cost coding agent these days? Kimi 2.6? Qwen&#x27;s latest closed model? Composer 2.5? DeepSeek?","created_at":"2026-05-24T14:29:25Z","created_at_i":1779632965,"objectID":"48257541","parent_id":48256953,"story_id":48256953,"story_title":"DeepSeek reasonix, DeepSeek native coding agent with high caching and low cost","story_url":"https://esengine.github.io/DeepSeek-Reasonix/","updated_at":"2026-05-27T17:01:59Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"truetotosse"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Hi HN, I built llmrequirements.com to answer &quot;what GPU should I buy for local models?&quot; for myself without leaving the site to google something.<p>It's a static site that maps every model in the open-weights ecosystem (Llama, <em>Qwen</em>, Mistral, DeepSeek, GLM, Kimi, Flux, Wan, ...) to the hardware that can actually run it, with three numbers per build/model pair sourced from llama.cpp / vLLM benchmarks rather than vendor marketing:<p><pre><code>  - tg/s (single-stream generation)\n  - pp (prefill / prompt-processing throughput)\n  - TTFT at 100k-token context, null when the KV cache won't fit\n</code></pre>\nHardware ranges from a Framework laptop up to an 8x H200 rack; software-stack maturity and extensibility get explicit 0-5 scores.<p>The data exported to a public repo, so anyone can PR a correction and the diff is reviewable.<p>Project started from the picker as a landing, but now it has state of the local AI page - SOLAI because all use cases now are somewhat unified under <em>coding</em>, <em>agent</em>, personal assistant. And models which can run such use cases are well defined as well."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Show HN: A picker that maps local LLMs to hardware, hardware to LLMs"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://llmrequirements.com/"}},"_tags":["comment","author_truetotosse","story_48256882"],"author":"truetotosse","children":[48262710],"comment_text":"Hi HN, I built llmrequirements.com to answer &quot;what GPU should I buy for local models?&quot; for myself without leaving the site to google something.<p>It&#x27;s a static site that maps every model in the open-weights ecosystem (Llama, Qwen, Mistral, DeepSeek, GLM, Kimi, Flux, Wan, ...) to the hardware that can actually run it, with three numbers per build&#x2F;model pair sourced from llama.cpp &#x2F; vLLM benchmarks rather than vendor marketing:<p><pre><code>  - tg&#x2F;s (single-stream generation)\n  - pp (prefill &#x2F; prompt-processing throughput)\n  - TTFT at 100k-token context, null when the KV cache won&#x27;t fit\n</code></pre>\nHardware ranges from a Framework laptop up to an 8x H200 rack; software-stack maturity and extensibility get explicit 0-5 scores.<p>The data exported to a public repo, so anyone can PR a correction and the diff is reviewable.<p>Project started from the picker as a landing, but now it has state of the local AI page - SOLAI because all use cases now are somewhat unified under coding, agent, personal assistant. And models which can run such use cases are well defined as well.","created_at":"2026-05-24T13:05:27Z","created_at_i":1779627927,"objectID":"48256977","parent_id":48256882,"story_id":48256882,"story_title":"Show HN: A picker that maps local LLMs to hardware, hardware to LLMs","story_url":"https://llmrequirements.com/","updated_at":"2026-05-25T01:42:49Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"throawayonthe"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"&gt; don't seem to outperform <em>Qwen3</em>.6 that much in <em>agent</em>ic <em>coding</em>/tasks<p>idk i imagine you'll hit less edges with a larger model just because.. more data<p>if you think of them as a kind of NN compression, it's ~obvious that the larger model can have more stuff encoded in it and hopefully accessible<p>i don't use LLMs much right now but using midrange models seems like an unnecessary compromise in most cases, especially since the big open models sound to be rivaling opus and not just sonnet :p"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Was my $48K GPU server worth it?"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://rosmine.ai/2026/05/13/was-my-48k-gpu-worth-it/"}},"_tags":["comment","author_throawayonthe","story_48184402"],"author":"throawayonthe","comment_text":"&gt; don&#x27;t seem to outperform Qwen3.6 that much in agentic coding&#x2F;tasks<p>idk i imagine you&#x27;ll hit less edges with a larger model just because.. more data<p>if you think of them as a kind of NN compression, it&#x27;s ~obvious that the larger model can have more stuff encoded in it and hopefully accessible<p>i don&#x27;t use LLMs much right now but using midrange models seems like an unnecessary compromise in most cases, especially since the big open models sound to be rivaling opus and not just sonnet :p","created_at":"2026-05-22T11:12:56Z","created_at_i":1779448376,"objectID":"48234293","parent_id":48229447,"story_id":48184402,"story_title":"Was my $48K GPU server worth it?","story_url":"https://rosmine.ai/2026/05/13/was-my-48k-gpu-worth-it/","updated_at":"2026-05-22T11:18:10Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"gpt5"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Sell the machine for $4K, use it to pay for Codex Pro for everyone for a year. Everyone will be significantly more productive and happy.<p>It's not even a real comparison if they are actually using them for <em>coding</em>.<p>If you are deploying always running <em>agents</em> (e.g. monitoring logs and services) then sure - a <em>QWEN</em> local server is a good choice. But for <em>coding</em> the cost in productivity of using a lower performing model is way too high."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Was my $48K GPU server worth it?"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://rosmine.ai/2026/05/13/was-my-48k-gpu-worth-it/"}},"_tags":["comment","author_gpt5","story_48184402"],"author":"gpt5","children":[48233404,48232975],"comment_text":"Sell the machine for $4K, use it to pay for Codex Pro for everyone for a year. Everyone will be significantly more productive and happy.<p>It&#x27;s not even a real comparison if they are actually using them for coding.<p>If you are deploying always running agents (e.g. monitoring logs and services) then sure - a QWEN local server is a good choice. But for coding the cost in productivity of using a lower performing model is way too high.","created_at":"2026-05-22T04:46:07Z","created_at_i":1779425167,"objectID":"48232048","parent_id":48229996,"story_id":48184402,"story_title":"Was my $48K GPU server worth it?","story_url":"https://rosmine.ai/2026/05/13/was-my-48k-gpu-worth-it/","updated_at":"2026-06-01T08:33:47Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"kgeist"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"I administer a simple AI server in the office, which just uses a single RTX 5090 but is able to serve ~80 people throughout the day. I'm impressed by <em>Qwen3</em>.6-27b's capabilities in <em>agent</em>ic <em>coding</em>/tasks so far. Devs say it's not much different from Sonnet 4.6 on many tasks (sometimes it even outperformed it), 40-60 tok/sec, up to 260k context. The server cost about $10k with all the bells and whistles.<p>I spent a lot of time researching/adding/benchmarking many custom modifications to the software stack and its settings to make the server optimally handle the load with just 1 RTX 5090 without losing quality, but it's still not enough, and the wait times in the queue are getting longer. We're at the limits of the hardware, and I'm out of tricks.<p>The experiment was kind of a success, and the CTO agrees we should scale it. With our own infra, we could run <em>agent</em>s 24/7 on everything. Currently, a lot of use cases for the cloud providers are completely blocked by PII/trade secret concerns (our infosec department doesn't buy the &quot;zero retention&quot; promise), plus you don't have to think about billing/budgets/etc. anymore.<p>Now I can't decide how to scale it. On one hand, I'd like to run larger models. And we have the budget to buy, say, 8xH200. But in many benchmarks, the larger models that do fit in 8xH200 comfortably and can serve many parallel requests with acceptable speed/quality don't seem to outperform <em>Qwen3</em>.6 that much in <em>agent</em>ic <em>coding</em>/tasks to justify the price.<p>So another option is just to buy a bunch of RTX 6000s and scale horizontally instead: run a copy of a midrange LLM like <em>Qwen3</em>.6 on each GPU. It's cheaper and easier to scale/replace, but then we'll run into problems running larger models in the future if we have to, because of no NVLink support (say, if Alibaba &amp; Co. stop releasing ~30b models and/or ~30b models start falling behind 400b+ models considerably)<p>Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Was my $48K GPU server worth it?"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://rosmine.ai/2026/05/13/was-my-48k-gpu-worth-it/"}},"_tags":["comment","author_kgeist","story_48184402"],"author":"kgeist","children":[48231400,48229996,48232353,48230057,48232800,48229771,48234293,48231198,48231450,48239245,48234137],"comment_text":"I administer a simple AI server in the office, which just uses a single RTX 5090 but is able to serve ~80 people throughout the day. I&#x27;m impressed by Qwen3.6-27b&#x27;s capabilities in agentic coding&#x2F;tasks so far. Devs say it&#x27;s not much different from Sonnet 4.6 on many tasks (sometimes it even outperformed it), 40-60 tok&#x2F;sec, up to 260k context. The server cost about $10k with all the bells and whistles.<p>I spent a lot of time researching&#x2F;adding&#x2F;benchmarking many custom modifications to the software stack and its settings to make the server optimally handle the load with just 1 RTX 5090 without losing quality, but it&#x27;s still not enough, and the wait times in the queue are getting longer. We&#x27;re at the limits of the hardware, and I&#x27;m out of tricks.<p>The experiment was kind of a success, and the CTO agrees we should scale it. With our own infra, we could run agents 24&#x2F;7 on everything. Currently, a lot of use cases for the cloud providers are completely blocked by PII&#x2F;trade secret concerns (our infosec department doesn&#x27;t buy the &quot;zero retention&quot; promise), plus you don&#x27;t have to think about billing&#x2F;budgets&#x2F;etc. anymore.<p>Now I can&#x27;t decide how to scale it. On one hand, I&#x27;d like to run larger models. And we have the budget to buy, say, 8xH200. But in many benchmarks, the larger models that do fit in 8xH200 comfortably and can serve many parallel requests with acceptable speed&#x2F;quality don&#x27;t seem to outperform Qwen3.6 that much in agentic coding&#x2F;tasks to justify the price.<p>So another option is just to buy a bunch of RTX 6000s and scale horizontally instead: run a copy of a midrange LLM like Qwen3.6 on each GPU. It&#x27;s cheaper and easier to scale&#x2F;replace, but then we&#x27;ll run into problems running larger models in the future if we have to, because of no NVLink support (say, if Alibaba &amp; Co. stop releasing ~30b models and&#x2F;or ~30b models start falling behind 400b+ models considerably)<p>Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)","created_at":"2026-05-21T22:10:41Z","created_at_i":1779401441,"objectID":"48229447","parent_id":48184402,"story_id":48184402,"story_title":"Was my $48K GPU server worth it?","story_url":"https://rosmine.ai/2026/05/13/was-my-48k-gpu-worth-it/","updated_at":"2026-06-01T08:33:17Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"girvo"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Right and all of my own evals back this up for Gemma 4...<p>...except its notably worse at <em>coding</em> in an <em>agent</em> context even with a harness setup to do exactly what Google says it should do (wrt. to sending summarised thinking back and so on)<p>So despite it being far better token efficiency wise, it's just worse for what I need to use it for compared to DSv4 Flash or <em>Qwen</em> 3.6 27B<p>Such a shame, too."},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["agent"],"value":"Qwen3.7-Max: The <em>Agent</em> Frontier"},"story_url":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["qwen"],"value":"https://<em>qwen</em>.ai/blog?id=qwen3.7"}},"_tags":["comment","author_girvo","story_48205626"],"author":"girvo","comment_text":"Right and all of my own evals back this up for Gemma 4...<p>...except its notably worse at coding in an agent context even with a harness setup to do exactly what Google says it should do (wrt. to sending summarised thinking back and so on)<p>So despite it being far better token efficiency wise, it&#x27;s just worse for what I need to use it for compared to DSv4 Flash or Qwen 3.6 27B<p>Such a shame, too.","created_at":"2026-05-21T21:47:35Z","created_at_i":1779400055,"objectID":"48229237","parent_id":48216325,"story_id":48205626,"story_title":"Qwen3.7-Max: The Agent Frontier","story_url":"https://qwen.ai/blog?id=qwen3.7","updated_at":"2026-05-21T21:49:52Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"baigy"},"story_text":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["qwen","coding"],"value":"There's extreme price escalation on part of Anthropic, with token spend now approaching levels that have made many-an-enterprise scratch their heads.<p>At the same time, judging by opensource advances (E.g. <em>Qwen</em> 3.6 27B), hosting a smart enough local LLM on 16GB VRAM (or equivalent) is increasingly becoming a reality. Lastly, I see most <em>coding</em> to be of intermediate difficulty, not beyond.<p>Seems to me it's a matter of time that people shift to free Claude Code type experiences, powered by local LLMs.<p>What do you think?"},"title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["coding","agent"],"value":"Ask HN: Is the next big thing locally running <em>coding</em> <em>agents</em>?"}},"_tags":["story","author_baigy","story_48223375","ask_hn"],"author":"baigy","children":[48223731,48225567,48223564,48223628,48224924,48308736,48224846,48225563],"created_at":"2026-05-21T14:35:32Z","created_at_i":1779374132,"num_comments":13,"objectID":"48223375","points":2,"story_id":48223375,"story_text":"There&#x27;s extreme price escalation on part of Anthropic, with token spend now approaching levels that have made many-an-enterprise scratch their heads.<p>At the same time, judging by opensource advances (E.g. Qwen 3.6 27B), hosting a smart enough local LLM on 16GB VRAM (or equivalent) is increasingly becoming a reality. Lastly, I see most coding to be of intermediate difficulty, not beyond.<p>Seems to me it&#x27;s a matter of time that people shift to free Claude Code type experiences, powered by local LLMs.<p>What do you think?","title":"Ask HN: Is the next big thing locally running coding agents?","updated_at":"2026-05-29T07:49:53Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"nullbio"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Taalas: <a href=\"https://taalas.com/products/\" rel=\"nofollow\">https://taalas.com/products/</a><p>They've made a hardware LLM that reaches over 14k TPS, and you can try it here: <a href=\"https://chatjimmy.ai/\" rel=\"nofollow\">https://chatjimmy.ai/</a><p>It seems most people are not aware of this, and so I think it's important for people to realize what is <em>coming</em>. It also kind of feels like the industry doesn't want people to know about this, because barely anyone talks about it. If they can make these at scale, then the cost of tokens should drop dramatically for the providers. The question is if they pass those savings on.<p>Imagine the latest <em>Qwen</em> 3.7 running at 14k TPS in <em>agent</em>ic loops... Even if the model doesn't get things right, being able to iterate that quickly on &quot;generate code -&gt; unit test&quot; will be absurd."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://taalas.com/products/"}},"_tags":["comment","author_nullbio","story_48219914"],"author":"nullbio","comment_text":"Taalas: <a href=\"https:&#x2F;&#x2F;taalas.com&#x2F;products&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;taalas.com&#x2F;products&#x2F;</a><p>They&#x27;ve made a hardware LLM that reaches over 14k TPS, and you can try it here: <a href=\"https:&#x2F;&#x2F;chatjimmy.ai&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;chatjimmy.ai&#x2F;</a><p>It seems most people are not aware of this, and so I think it&#x27;s important for people to realize what is coming. It also kind of feels like the industry doesn&#x27;t want people to know about this, because barely anyone talks about it. If they can make these at scale, then the cost of tokens should drop dramatically for the providers. The question is if they pass those savings on.<p>Imagine the latest Qwen 3.7 running at 14k TPS in agentic loops... Even if the model doesn&#x27;t get things right, being able to iterate that quickly on &quot;generate code -&gt; unit test&quot; will be absurd.","created_at":"2026-05-21T09:21:05Z","created_at_i":1779355265,"objectID":"48219915","parent_id":48219914,"story_id":48219914,"story_title":"Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B","story_url":"https://taalas.com/products/","updated_at":"2026-05-21T09:25:50Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"gcr"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"here's a simple setup to get you started on an Apple M1 Max from 2021 with 32GB VRAM. it will download 20GB of models to `~/.cache/huggingface/hub`, which you can delete when you're done.<p><pre><code>  /Users/gcr/llama.cpp/build/bin/llama-server\n      -hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M\n      --no-mmproj-offload\n      --fit on\n      -c 65536 # edit to taste\n      --reasoning on --chat-template-kwargs '{&quot;preserve_thinking&quot;: true}'\n      --sleep-idle-seconds 90 # very aggressive: purge model from vram after this long\n      -ctk q8_0 -ctv q8_0 # Optional. Lower memory use, but lower speed. Omit if you can.\n</code></pre>\nI don't recommend ollama or lm-studio. Ollama's in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don't recommend MLX-based inference backends on this hardware; I've found them to consistently reduce performance, contrary to what I've read online. I've tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It's all dust in the wind still.<p>For <em>agent</em> harnesses, opencode is okay, as is pi or even Zed's built in <em>agent</em> panel. Claude code &quot;works&quot; with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-<em>agent</em> under an otherwise-mostly-default setup. You can ask <em>qwen</em> to customize pi or write you an extension which helps a little.<p>You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your <em>coding</em> harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).<p>Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc<p>Take backups and then go have fun. Hope this helps."},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["agent"],"value":"Qwen3.7-Max: The <em>Agent</em> Frontier"},"story_url":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["qwen"],"value":"https://<em>qwen</em>.ai/blog?id=qwen3.7"}},"_tags":["comment","author_gcr","story_48205626"],"author":"gcr","children":[48217053,48215615,48216636],"comment_text":"here&#x27;s a simple setup to get you started on an Apple M1 Max from 2021 with 32GB VRAM. it will download 20GB of models to `~&#x2F;.cache&#x2F;huggingface&#x2F;hub`, which you can delete when you&#x27;re done.<p><pre><code>  &#x2F;Users&#x2F;gcr&#x2F;llama.cpp&#x2F;build&#x2F;bin&#x2F;llama-server\n      -hf unsloth&#x2F;Qwen3.6-35B-A3B-GGUF:Q4_K_M\n      --no-mmproj-offload\n      --fit on\n      -c 65536 # edit to taste\n      --reasoning on --chat-template-kwargs &#x27;{&quot;preserve_thinking&quot;: true}&#x27;\n      --sleep-idle-seconds 90 # very aggressive: purge model from vram after this long\n      -ctk q8_0 -ctv q8_0 # Optional. Lower memory use, but lower speed. Omit if you can.\n</code></pre>\nI don&#x27;t recommend ollama or lm-studio. Ollama&#x27;s in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don&#x27;t recommend MLX-based inference backends on this hardware; I&#x27;ve found them to consistently reduce performance, contrary to what I&#x27;ve read online. I&#x27;ve tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It&#x27;s all dust in the wind still.<p>For agent harnesses, opencode is okay, as is pi or even Zed&#x27;s built in agent panel. Claude code &quot;works&quot; with ANTHROPIC_BASE_URL=http:&#x2F;&#x2F;localhost:8080&#x2F;v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I&#x27;ve personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.<p>You&#x27;ll need to add `http:&#x2F;&#x2F;localhost:8080&#x2F;v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn&#x27;t matter) and any model identifier (doesn&#x27;t matter with llama-cpp).<p>Note that pi doesn&#x27;t have permissions. Everything is permitted. The hundred hungry ghosts you&#x27;ve trapped in a jar WILL find a way to delete your home folder someday. That&#x27;s what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc<p>Take backups and then go have fun. Hope this helps.","created_at":"2026-05-20T17:19:31Z","created_at_i":1779297571,"objectID":"48211003","parent_id":48209432,"story_id":48205626,"story_title":"Qwen3.7-Max: The Agent Frontier","story_url":"https://qwen.ai/blog?id=qwen3.7","updated_at":"2026-05-23T10:51:12Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"ramses0"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"I'd held off from buying a new personal laptop for quite a few years and felt that the M5-128gb was justifiable once I started really seeing payoffs from using AI at work.<p>Running w/ Cursor and doing some &quot;nights and weekends&quot; type <em>coding</em> / conversations, I was hitting $100-200 of usage within a few weeks.  I know there's probably better ways to manage costs, but I was getting enough value out of it to keep bumping my spend limit from $20 =&gt; $40 =&gt; $80 =&gt; $120 (and then I stopped spending! :-)<p>Messing around with local-llm, I've settled on `omlx` and `gemma` for &quot;conversational&quot;, and I think it's `<em>qwen</em>-120b-a3b-6bit` or something for the &quot;heavy hitter&quot;.  Gemma &quot;gets it&quot; a lot more, whereas that particular `<em>qwen</em>` tends to fall into the &quot;MuSt WrItE CoOooDeee!&quot; behaviour in a lot of cases instead of holding a conversation, and does an awesome job of randomly spitting out ascii-art diagrams or including full-blown bash shell scripts to illustrate different cases.<p>My POV is: &quot;Local for slightly slower/casual usage&quot;, the ~1% of battery usage per minute of LLM is shockingly accurate (eg: 30 minutes == 30% drop!).  &quot;Gemma for discussion and emitting DESIGN-... docs&quot;, and &quot;<em>Qwen</em> for converting DESIGN-... to PLAN-...&quot;, (as well as implementation, but generally from a fresh context loading the relevant PLAN-... or supporting docs)<p>...then supplement that with direct Cursor usage in case I screw up some setting on being able to get the local LLM working, or if I need to include literal web-research or really having access to some SOTA model. Using the pi-coder harness locally, web pages are kindof a difficult conundrum as they can be kindof gigantic and are really worthy of special casing, some sort of sub-harness, etc... but the more &quot;stuff&quot; you put into the <em>agent</em>, the less context window (and memory!) you have available, so it's a real balancing act.<p>The other biggest problem is that you're limited (locally) to ~20-80tps and in some cases you have to chew on or &quot;swallow&quot; the whole prompt up to that point if you end up with some sort of cache miss (TTFT).  The `omlx` server does a pretty good job (after you tweak some settings and stuff) of allowing MANY prompt continuations to nearly immediately start generated tokens, but sometimes if I have two agents going (eg: Gemma talking shit about <em>Qwen</em>'s output or vice versa) in a longer context window, then you'll take that hit.<p>&quot;Other people's compute&quot; is definitely more freeing, but even looking at $200/mo usage that's $2400 vs. the ~$6k for a maxed out MBP.  Call it $2500 vs. $7500 and you'd say that &quot;local AI gives you a 3-year amortization window for a slower, worse experience&quot; ... but if you're strategic about your usage, the ability to &quot;talk for free&quot; and occasionally &quot;burst&quot; to an online provider or having some hugging-face tokens to try out different models that you can't quite run locally is really nice.  Talking to the AI (locally) to even just do non-<em>coding</em> planning without worrying about data leakage or privacy issues is phenomenal, and you end up owning a really nice laptop!<p>In some ways, seeing the &quot;advantage&quot; of having the local 128gb capacity for LLM, I'm semi-wishing I'd have gotten a mac mini instead, but then I can't quite do the 100% offline stuff (eg: coffee-shop) that the maxed out laptop allows.<p>If it were a mini running locally, I'd feel more comfortable calling it the always-on &quot;AI brain&quot; to process my emails, run crontab summaries, whatever kindof &quot;open-claw-ish&quot; stuff that you could do w/o relying on having to &quot;keep the laptop lid open all the time&quot;.  I'm sure there's ways to repurpose things, but longer-term, call it even 3-5 years from now... any sort of 128gb machine will be more than capable where you'd want to have one &quot;doing stuff&quot; locally within your home network (IMHO)."},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["agent"],"value":"Qwen3.7-Max: The <em>Agent</em> Frontier"},"story_url":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["qwen"],"value":"https://<em>qwen</em>.ai/blog?id=qwen3.7"}},"_tags":["comment","author_ramses0","story_48205626"],"author":"ramses0","children":[48210144],"comment_text":"I&#x27;d held off from buying a new personal laptop for quite a few years and felt that the M5-128gb was justifiable once I started really seeing payoffs from using AI at work.<p>Running w&#x2F; Cursor and doing some &quot;nights and weekends&quot; type coding &#x2F; conversations, I was hitting $100-200 of usage within a few weeks.  I know there&#x27;s probably better ways to manage costs, but I was getting enough value out of it to keep bumping my spend limit from $20 =&gt; $40 =&gt; $80 =&gt; $120 (and then I stopped spending! :-)<p>Messing around with local-llm, I&#x27;ve settled on `omlx` and `gemma` for &quot;conversational&quot;, and I think it&#x27;s `qwen-120b-a3b-6bit` or something for the &quot;heavy hitter&quot;.  Gemma &quot;gets it&quot; a lot more, whereas that particular `qwen` tends to fall into the &quot;MuSt WrItE CoOooDeee!&quot; behaviour in a lot of cases instead of holding a conversation, and does an awesome job of randomly spitting out ascii-art diagrams or including full-blown bash shell scripts to illustrate different cases.<p>My POV is: &quot;Local for slightly slower&#x2F;casual usage&quot;, the ~1% of battery usage per minute of LLM is shockingly accurate (eg: 30 minutes == 30% drop!).  &quot;Gemma for discussion and emitting DESIGN-... docs&quot;, and &quot;Qwen for converting DESIGN-... to PLAN-...&quot;, (as well as implementation, but generally from a fresh context loading the relevant PLAN-... or supporting docs)<p>...then supplement that with direct Cursor usage in case I screw up some setting on being able to get the local LLM working, or if I need to include literal web-research or really having access to some SOTA model. Using the pi-coder harness locally, web pages are kindof a difficult conundrum as they can be kindof gigantic and are really worthy of special casing, some sort of sub-harness, etc... but the more &quot;stuff&quot; you put into the agent, the less context window (and memory!) you have available, so it&#x27;s a real balancing act.<p>The other biggest problem is that you&#x27;re limited (locally) to ~20-80tps and in some cases you have to chew on or &quot;swallow&quot; the whole prompt up to that point if you end up with some sort of cache miss (TTFT).  The `omlx` server does a pretty good job (after you tweak some settings and stuff) of allowing MANY prompt continuations to nearly immediately start generated tokens, but sometimes if I have two agents going (eg: Gemma talking shit about Qwen&#x27;s output or vice versa) in a longer context window, then you&#x27;ll take that hit.<p>&quot;Other people&#x27;s compute&quot; is definitely more freeing, but even looking at $200&#x2F;mo usage that&#x27;s $2400 vs. the ~$6k for a maxed out MBP.  Call it $2500 vs. $7500 and you&#x27;d say that &quot;local AI gives you a 3-year amortization window for a slower, worse experience&quot; ... but if you&#x27;re strategic about your usage, the ability to &quot;talk for free&quot; and occasionally &quot;burst&quot; to an online provider or having some hugging-face tokens to try out different models that you can&#x27;t quite run locally is really nice.  Talking to the AI (locally) to even just do non-coding planning without worrying about data leakage or privacy issues is phenomenal, and you end up owning a really nice laptop!<p>In some ways, seeing the &quot;advantage&quot; of having the local 128gb capacity for LLM, I&#x27;m semi-wishing I&#x27;d have gotten a mac mini instead, but then I can&#x27;t quite do the 100% offline stuff (eg: coffee-shop) that the maxed out laptop allows.<p>If it were a mini running locally, I&#x27;d feel more comfortable calling it the always-on &quot;AI brain&quot; to process my emails, run crontab summaries, whatever kindof &quot;open-claw-ish&quot; stuff that you could do w&#x2F;o relying on having to &quot;keep the laptop lid open all the time&quot;.  I&#x27;m sure there&#x27;s ways to repurpose things, but longer-term, call it even 3-5 years from now... any sort of 128gb machine will be more than capable where you&#x27;d want to have one &quot;doing stuff&quot; locally within your home network (IMHO).","created_at":"2026-05-20T15:39:29Z","created_at_i":1779291569,"objectID":"48209553","parent_id":48207086,"story_id":48205626,"story_title":"Qwen3.7-Max: The Agent Frontier","story_url":"https://qwen.ai/blog?id=qwen3.7","updated_at":"2026-05-21T05:18:49Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"lostmsu"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"<em>Qwen</em> recommends to preserve_thinking: true for <em>agent</em>ic/<em>coding</em> workloads."},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["agent"],"value":"Qwen3.7-Max: The <em>Agent</em> Frontier"},"story_url":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["qwen"],"value":"https://<em>qwen</em>.ai/blog?id=qwen3.7"}},"_tags":["comment","author_lostmsu","story_48205626"],"author":"lostmsu","children":[48211167],"comment_text":"Qwen recommends to preserve_thinking: true for agentic&#x2F;coding workloads.","created_at":"2026-05-20T15:28:26Z","created_at_i":1779290906,"objectID":"48209357","parent_id":48207263,"story_id":48205626,"story_title":"Qwen3.7-Max: The Agent Frontier","story_url":"https://qwen.ai/blog?id=qwen3.7","updated_at":"2026-05-22T16:55:10Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"zambelli"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"I believe there's a comment below mentioning &quot;<em>qwen</em>&quot; but not a specific version number - if you're looking for 3rd party validation. I've personally tried qwen3.6-35b-a3b, qwen3.5-35b-a3b, and qwen3.5-27b with forge (<em>agent</em>ic <em>coding</em> harness built on forge workflowrunner) and it works great. Official forge eval benchmarks for that class of models is still a couple of weeks out.<p>Proxy mode should work fine with remote models, the only constraint is the compatible endpoint - which is standard anyways. I don't think you'd have any issue hitting either a remote gateway like liteLLM or just claude API."},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["agent"],"value":"Show HN: Forge \u2013 Guardrails take an 8B model from 53% to 99% on <em>agent</em>ic tasks"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/antoinezambelli/forge"}},"_tags":["comment","author_zambelli","story_48192383"],"author":"zambelli","children":[48212522],"comment_text":"I believe there&#x27;s a comment below mentioning &quot;qwen&quot; but not a specific version number - if you&#x27;re looking for 3rd party validation. I&#x27;ve personally tried qwen3.6-35b-a3b, qwen3.5-35b-a3b, and qwen3.5-27b with forge (agentic coding harness built on forge workflowrunner) and it works great. Official forge eval benchmarks for that class of models is still a couple of weeks out.<p>Proxy mode should work fine with remote models, the only constraint is the compatible endpoint - which is standard anyways. I don&#x27;t think you&#x27;d have any issue hitting either a remote gateway like liteLLM or just claude API.","created_at":"2026-05-20T05:26:39Z","created_at_i":1779254799,"objectID":"48203443","parent_id":48203298,"story_id":48192383,"story_title":"Show HN: Forge \u2013 Guardrails take an 8B model from 53% to 99% on agentic tasks","story_url":"https://github.com/antoinezambelli/forge","updated_at":"2026-05-20T19:09:19Z"}],"hitsPerPage":20,"nbHits":301,"nbPages":16,"page":0,"params":"query=qwen+coding+agent&advancedSyntax=true&analyticsTags=backend","processingTimeMS":10,"processingTimingsMS":{"_request":{"roundTrip":17},"afterFetch":{"format":{"highlighting":1,"total":2}},"fetch":{"query":6,"scanning":2,"total":9},"total":10},"query":"qwen coding agent","serverTimeMS":13}
