{"exhaustive":{"nbHits":false,"typo":false},"exhaustiveNbHits":false,"exhaustiveTypo":false,"hits":[{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"tristanj"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Those links actually weaken your argument.<p>~3000 instances over the entirety of GitHub is not &quot;extremely frequent&quot; at all when you consider massive scale of the pretraining corpus. These models are trained on all text on the accessible internet plus millions of books, on trillions of words overall. ~3000 instances isn't even a rounding error.<p>Also, Google Trends is the wrong tool for this case. The link you replied with measures the <i>number of Google searches</i> for a specific query, which is completely irrelevant to how often a string actually appears on the web.<p>I looked at those links. They show random GitHub code samples which contain this model identifier. Even if K3 did train on those GitHub code samples, those are random code strings and K3 isn\u2019t just reciting random pieces of code. When prefilled with &quot;I am Claude&quot;, K3 answers with Anthropic's <i>exact backend API identifiers</i>, instead of the human conversational-style name &quot;Opus 4.5&quot;.<p>If this was a result of scraping the open web and GitHub, other frontier models trained on GitHub data (like <em>Qwen</em>, Claude, GPT) would show a similar API model name when prompted. But they don't. Only K3 does this behavior, and it prefers to answer with the exact API tags like &quot;claude-opus-4-5-20251101&quot; and &quot;claude-sonnet-4-20250514&quot;.<p>A LLM doesn't suddenly output a rare API model identifier when prompted with &quot;I am Claude&quot;, just because it saw it in a .py file. No, it only does this because that identifier must have appeared rather frequently in the training data. Which is what happens when you distill Claude models, and don't properly clean your distillation data.<p>Plus, this isn't just spurious speculation that Moonshot is distilling Claude. Anthropic caught Moonshot as running &quot;industrial-scale&quot; distillation campaign from Claude: <a href=\"https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks\" rel=\"nofollow\">https://www.anthropic.com/news/detecting-and-preventing-dist...</a><p>To quote:<p><i>Moonshot (Kimi models) employed hundreds of fraudulent accounts spanning multiple access pathways. Varied account types made the campaign harder to detect as a coordinated operation. We attributed the campaign through request metadata, which matched the public profiles of senior Moonshot staff. In a later phase, Moonshot used a more targeted approach, attempting to extract and reconstruct Claude\u2019s reasoning traces.</i><p><i>The operation targeted:</i><p>* <i><em>Agent</em>ic reasoning and tool use</i><p>* <i><em>Coding</em> and data analysis</i><p>* <i>Computer-use <em>agent</em> development</i><p>* <i>Computer vision</i>"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"The Kimi K3 Moment"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://stephen.bochinski.dev/blog/2026/07/18/the-kimi-k3-moment/"}},"_tags":["comment","author_tristanj","story_48960218"],"author":"tristanj","comment_text":"Those links actually weaken your argument.<p>~3000 instances over the entirety of GitHub is not &quot;extremely frequent&quot; at all when you consider massive scale of the pretraining corpus. These models are trained on all text on the accessible internet plus millions of books, on trillions of words overall. ~3000 instances isn&#x27;t even a rounding error.<p>Also, Google Trends is the wrong tool for this case. The link you replied with measures the <i>number of Google searches</i> for a specific query, which is completely irrelevant to how often a string actually appears on the web.<p>I looked at those links. They show random GitHub code samples which contain this model identifier. Even if K3 did train on those GitHub code samples, those are random code strings and K3 isn\u2019t just reciting random pieces of code. When prefilled with &quot;I am Claude&quot;, K3 answers with Anthropic&#x27;s <i>exact backend API identifiers</i>, instead of the human conversational-style name &quot;Opus 4.5&quot;.<p>If this was a result of scraping the open web and GitHub, other frontier models trained on GitHub data (like Qwen, Claude, GPT) would show a similar API model name when prompted. But they don&#x27;t. Only K3 does this behavior, and it prefers to answer with the exact API tags like &quot;claude-opus-4-5-20251101&quot; and &quot;claude-sonnet-4-20250514&quot;.<p>A LLM doesn&#x27;t suddenly output a rare API model identifier when prompted with &quot;I am Claude&quot;, just because it saw it in a .py file. No, it only does this because that identifier must have appeared rather frequently in the training data. Which is what happens when you distill Claude models, and don&#x27;t properly clean your distillation data.<p>Plus, this isn&#x27;t just spurious speculation that Moonshot is distilling Claude. Anthropic caught Moonshot as running &quot;industrial-scale&quot; distillation campaign from Claude: <a href=\"https:&#x2F;&#x2F;www.anthropic.com&#x2F;news&#x2F;detecting-and-preventing-distillation-attacks\" rel=\"nofollow\">https:&#x2F;&#x2F;www.anthropic.com&#x2F;news&#x2F;detecting-and-preventing-dist...</a><p>To quote:<p><i>Moonshot (Kimi models) employed hundreds of fraudulent accounts spanning multiple access pathways. Varied account types made the campaign harder to detect as a coordinated operation. We attributed the campaign through request metadata, which matched the public profiles of senior Moonshot staff. In a later phase, Moonshot used a more targeted approach, attempting to extract and reconstruct Claude\u2019s reasoning traces.</i><p><i>The operation targeted:</i><p>* <i>Agentic reasoning and tool use</i><p>* <i>Coding and data analysis</i><p>* <i>Computer-use agent development</i><p>* <i>Computer vision</i>","created_at":"2026-07-20T13:48:17Z","created_at_i":1784555297,"objectID":"48978847","parent_id":48975426,"story_id":48960218,"story_title":"The Kimi K3 Moment","story_url":"https://stephen.bochinski.dev/blog/2026/07/18/the-kimi-k3-moment/","updated_at":"2026-07-20T13:51:54Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"jknoepfler"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"The capital required to train <em>Qwen</em> at scale is enormous. The capital required to patch a linux distro is near zero. Any scale model <em>coming</em> out of China should be viewed as advancing a geopolitical <em>agend</em>a. The same skepticism should be applied to any model trained in the states, but that should be viewed through the lens of short-term business profits.<p>There are earnest nerds everywhere, in every society. No doubt. But &quot;Chinese AI Labs&quot; operate at the whim of the Chinese government, in the same way &quot;American AI Labs&quot; operate at the whim of billionaire investors. Inferring good will from either is naive at best at this scale."},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["qwen"],"value":"<em>Qwen</em> 3.8"},"story_url":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["qwen"],"value":"https://twitter.com/Alibaba_<em>Qwen</em>/status/2078759124914098291"}},"_tags":["comment","author_jknoepfler","story_48966120"],"author":"jknoepfler","comment_text":"The capital required to train Qwen at scale is enormous. The capital required to patch a linux distro is near zero. Any scale model coming out of China should be viewed as advancing a geopolitical agenda. The same skepticism should be applied to any model trained in the states, but that should be viewed through the lens of short-term business profits.<p>There are earnest nerds everywhere, in every society. No doubt. But &quot;Chinese AI Labs&quot; operate at the whim of the Chinese government, in the same way &quot;American AI Labs&quot; operate at the whim of billionaire investors. Inferring good will from either is naive at best at this scale.","created_at":"2026-07-20T13:45:20Z","created_at_i":1784555120,"objectID":"48978820","parent_id":48969287,"story_id":48966120,"story_title":"Qwen 3.8","story_url":"https://twitter.com/Alibaba_Qwen/status/2078759124914098291","updated_at":"2026-07-20T13:46:39Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"htrp"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Don't use Opencode, Don't use remote models (cloud providers), Don't use Docker to isolate <em>coding</em> <em>agents</em>.<p>May as well write a post saying don't use LLM's for any SWE work.<p>&gt;Conclusion\nStop using OpenCode.<p>&gt;Post-script: Local LLMs\nThis is worth its own post \u2013 I have multiple attempts in my blog drafts \u2013 but it needs to be addressed briefly here. My opinion on local LLMs like <em>Qwen3</em>.6-27B is they are corrosive to the stability and conceptual fidelity of your codebase in the same way as frontier models, with the following three differences:<p>&gt;You avoid the uncanny valley where the model appears to be intelligent before doing something stupid; the stupidity is self-evident and this helps calibrate your interactions.<p>&gt;The weight count is too low to reproduce the training set verbatim, which nudges the calculus on whether the output should be considered tainted. This is distinct from larger models which can reproduce inputs verbatim, but are trained to refuse to.<p>&gt;You avoid supporting or relying upon cloud providers.<p>&gt;I\u2019ve had useful results from input-oriented tasks like: \u201cI think there is a bug in code x with symptoms y, my guess on the mechanism is z. Read all relevant code, come back with a call chain and code citations.\u201d Framing it as a search problem reins in the clanker\u2019s propensity to make shit up.<p>&gt;Using LLMs for code generation feels like a dead end. However thoroughly you think you understand your architecture, your planning is constantly undone by shortcuts like \u201cwhat if I just move this mutable state into the middle of the design so everyone can share it?\u201d This is hostile to your ability to understand your code, beyond the fact that you didn\u2019t write it.<p>&gt;Drawing answers directly from knowledge in model weights leads to hallucination even for multi-trillion-parameter models, so why bother making them that big? If people were realistic about limitations then we wouldn\u2019t be building new power stations for datacenters, and they wouldn\u2019t be rammed into every product.<p>&gt;The entire software ecosystem around LLMs is completely rotten, and if they do ever become \u201cjust a tool\u201d then some actual systems engineering needs to be done around them to turn them into tools instead of security black holes. That work will have to be done by humans."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Annoying and alarming things about OpenCode"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://wren.wtf/shower-thoughts/stop-using-opencode/"}},"_tags":["comment","author_htrp","story_48978112"],"author":"htrp","children":[48979013],"comment_text":"Don&#x27;t use Opencode, Don&#x27;t use remote models (cloud providers), Don&#x27;t use Docker to isolate coding agents.<p>May as well write a post saying don&#x27;t use LLM&#x27;s for any SWE work.<p>&gt;Conclusion\nStop using OpenCode.<p>&gt;Post-script: Local LLMs\nThis is worth its own post \u2013 I have multiple attempts in my blog drafts \u2013 but it needs to be addressed briefly here. My opinion on local LLMs like Qwen3.6-27B is they are corrosive to the stability and conceptual fidelity of your codebase in the same way as frontier models, with the following three differences:<p>&gt;You avoid the uncanny valley where the model appears to be intelligent before doing something stupid; the stupidity is self-evident and this helps calibrate your interactions.<p>&gt;The weight count is too low to reproduce the training set verbatim, which nudges the calculus on whether the output should be considered tainted. This is distinct from larger models which can reproduce inputs verbatim, but are trained to refuse to.<p>&gt;You avoid supporting or relying upon cloud providers.<p>&gt;I\u2019ve had useful results from input-oriented tasks like: \u201cI think there is a bug in code x with symptoms y, my guess on the mechanism is z. Read all relevant code, come back with a call chain and code citations.\u201d Framing it as a search problem reins in the clanker\u2019s propensity to make shit up.<p>&gt;Using LLMs for code generation feels like a dead end. However thoroughly you think you understand your architecture, your planning is constantly undone by shortcuts like \u201cwhat if I just move this mutable state into the middle of the design so everyone can share it?\u201d This is hostile to your ability to understand your code, beyond the fact that you didn\u2019t write it.<p>&gt;Drawing answers directly from knowledge in model weights leads to hallucination even for multi-trillion-parameter models, so why bother making them that big? If people were realistic about limitations then we wouldn\u2019t be building new power stations for datacenters, and they wouldn\u2019t be rammed into every product.<p>&gt;The entire software ecosystem around LLMs is completely rotten, and if they do ever become \u201cjust a tool\u201d then some actual systems engineering needs to be done around them to turn them into tools instead of security black holes. That work will have to be done by humans.","created_at":"2026-07-20T13:21:42Z","created_at_i":1784553702,"objectID":"48978519","parent_id":48978112,"story_id":48978112,"story_title":"Annoying and alarming things about OpenCode","story_url":"https://wren.wtf/shower-thoughts/stop-using-opencode/","updated_at":"2026-07-20T17:50:25Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"vardalab"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"One thing I find that's missing a lot or at least I haven't come across other than commercial offerings like AquaVoice is a decent injected technical vocabulary so that the initial transcript requires minimum cleanup afterwards. Because I mostly use these tools to essentially ramble at the command line with <em>coding</em> <em>agents</em>. So there's a lot of technical terms that don't translate well. Like OpenBao comes out as open bowel sometimes,lol. That necessitates significant cleanup prompt or background text available to the cleanup llm, usually in the form of screenshot or something that gets converted to text but that in turn requires good hw for speed to be almost imperceptible. For example m5 max turns cleanup into a noticeable delay while 5090 is decent.<p>Only way I have found that's relatively easy to inject technical vocab is to use whisper, but limited, I think to about 220 or so tokens. Whisper has sort of like a priming prompt where one can put in a bunch of technical words and it will try to recognize those. But again, that's limited to small number tokens. And that limits one use a relatively slow, by today's standards, whisper.cpp.<p>I benchmarked it across a bunch of different hardware that I have available, and Whisper gives decent performance as far as speed goes only on a pretty top-end GPU, such as a 5090 or 4070, like for example on Strix Halo, it's still relatively slow for longer transcriptions because I prefer just a stream of consciousness ramblings for minutes and then that being transcribed and cleaned up versus short sentences. So in that scenario something like 5090 really is good because the cleanup prompt runs fast using usually <em>Qwen</em> 3.6 MOE model. Whisper on 4070 itself is about 0.7 seconds for two or three minute transcription. So the total wait time for a three-minute transcription is roughly a second, or a little bit more than a second, so totally acceptable. But it does take decent hardware, and it grows to be double that on if running totally local. Well, in my case, it's all local, but it's my own hardware all over the place, but truly running on laptop, it's much faster using Parakeet, but then the cleanup is the bottleneck.<p>Anyway, it's just my experience messing around with this for the last year. I did start using AquaVoice, but their speed was exceptional, and tech vocab was exceptional, but they would have some annoying delays occasionally, and I didn't like paying the money and sending sensitive topics and screenshots into the cloud, and I had hardware, so my local solution is basically almost as good as commercial one. But I think they train their own model. So what I'm doing is I collect all the samples of my transcriptions, and I am slowly building my own data set that hopefully at some point when I get energy I will find some way to fine tune something."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Transcribe.cpp"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://workshop.cjpais.com/projects/transcribe-cpp"}},"_tags":["comment","author_vardalab","story_48963879"],"author":"vardalab","children":[48971851],"comment_text":"One thing I find that&#x27;s missing a lot or at least I haven&#x27;t come across other than commercial offerings like AquaVoice is a decent injected technical vocabulary so that the initial transcript requires minimum cleanup afterwards. Because I mostly use these tools to essentially ramble at the command line with coding agents. So there&#x27;s a lot of technical terms that don&#x27;t translate well. Like OpenBao comes out as open bowel sometimes,lol. That necessitates significant cleanup prompt or background text available to the cleanup llm, usually in the form of screenshot or something that gets converted to text but that in turn requires good hw for speed to be almost imperceptible. For example m5 max turns cleanup into a noticeable delay while 5090 is decent.<p>Only way I have found that&#x27;s relatively easy to inject technical vocab is to use whisper, but limited, I think to about 220 or so tokens. Whisper has sort of like a priming prompt where one can put in a bunch of technical words and it will try to recognize those. But again, that&#x27;s limited to small number tokens. And that limits one use a relatively slow, by today&#x27;s standards, whisper.cpp.<p>I benchmarked it across a bunch of different hardware that I have available, and Whisper gives decent performance as far as speed goes only on a pretty top-end GPU, such as a 5090 or 4070, like for example on Strix Halo, it&#x27;s still relatively slow for longer transcriptions because I prefer just a stream of consciousness ramblings for minutes and then that being transcribed and cleaned up versus short sentences. So in that scenario something like 5090 really is good because the cleanup prompt runs fast using usually Qwen 3.6 MOE model. Whisper on 4070 itself is about 0.7 seconds for two or three minute transcription. So the total wait time for a three-minute transcription is roughly a second, or a little bit more than a second, so totally acceptable. But it does take decent hardware, and it grows to be double that on if running totally local. Well, in my case, it&#x27;s all local, but it&#x27;s my own hardware all over the place, but truly running on laptop, it&#x27;s much faster using Parakeet, but then the cleanup is the bottleneck.<p>Anyway, it&#x27;s just my experience messing around with this for the last year. I did start using AquaVoice, but their speed was exceptional, and tech vocab was exceptional, but they would have some annoying delays occasionally, and I didn&#x27;t like paying the money and sending sensitive topics and screenshots into the cloud, and I had hardware, so my local solution is basically almost as good as commercial one. But I think they train their own model. So what I&#x27;m doing is I collect all the samples of my transcriptions, and I am slowly building my own data set that hopefully at some point when I get energy I will find some way to fine tune something.","created_at":"2026-07-19T19:37:45Z","created_at_i":1784489865,"objectID":"48971073","parent_id":48963879,"story_id":48963879,"story_title":"Transcribe.cpp","story_url":"https://workshop.cjpais.com/projects/transcribe-cpp","updated_at":"2026-07-19T21:32:50Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"dylangrech92"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Hey HN \u2014 I'm the main dev behind Chalie.<p>Why I built Chalie: When I first started developing it, the idea of <em>agents</em> was still very new (late 2025, early 2026). I wanted a clean web UI, not a terminal window, and I wanted to use it from anywhere in the world \u2014 not stuck on one machine.<p>Why did it take so long: I released the first alpha around mid-March but never really marketed anywhere, Chalie has actually pushed me towards making this post.<p>What makes it different: My thesis is simple \u2014 I don't want to juggle sessions, I want the ChatGPT experience but with some agency. I don't want to create another <em>coding</em> <em>agent</em>. I want something to discuss daily topics with: diet, family drama, job, ideas. Something that won't judge me but be a peer. That is why memory was critical from day one and is what the whole harness is designed around.<p>Is it done? It will never be done, but I've poured my heart and soul into making it as clean and efficient as possible, and I'm constantly coming up with new ways to pressure test and optimize it to work on smaller and smaller models so that it can be truly private.<p>Can it run totally local? Yes, kinda. For quality usage it's best paired with frontier models for now, but I've tested extensively with local models such as <em>qwen3</em>.6:27b and ornith:35b, and the results are satisfactory.<p>Is it a competitor to OpenClaw / Hermes / Claude / Codex / etc? If you consider an AI harness a competitor, sure \u2014 but at the same time, not really. Chalie is designed to be a peer, not an employee. If you want an <em>agent</em> to just do work, use any of the ones mentioned earlier (Chalie itself is developed with Claude). But if you want a partner, a listener, a peer \u2014 then Chalie is for you.<p>Thanks :)"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Show HN: Chalie \u2013 AI peer not employee"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/chalie-ai/chalie"}},"_tags":["comment","author_dylangrech92","story_48971042"],"author":"dylangrech92","comment_text":"Hey HN \u2014 I&#x27;m the main dev behind Chalie.<p>Why I built Chalie: When I first started developing it, the idea of agents was still very new (late 2025, early 2026). I wanted a clean web UI, not a terminal window, and I wanted to use it from anywhere in the world \u2014 not stuck on one machine.<p>Why did it take so long: I released the first alpha around mid-March but never really marketed anywhere, Chalie has actually pushed me towards making this post.<p>What makes it different: My thesis is simple \u2014 I don&#x27;t want to juggle sessions, I want the ChatGPT experience but with some agency. I don&#x27;t want to create another coding agent. I want something to discuss daily topics with: diet, family drama, job, ideas. Something that won&#x27;t judge me but be a peer. That is why memory was critical from day one and is what the whole harness is designed around.<p>Is it done? It will never be done, but I&#x27;ve poured my heart and soul into making it as clean and efficient as possible, and I&#x27;m constantly coming up with new ways to pressure test and optimize it to work on smaller and smaller models so that it can be truly private.<p>Can it run totally local? Yes, kinda. For quality usage it&#x27;s best paired with frontier models for now, but I&#x27;ve tested extensively with local models such as qwen3.6:27b and ornith:35b, and the results are satisfactory.<p>Is it a competitor to OpenClaw &#x2F; Hermes &#x2F; Claude &#x2F; Codex &#x2F; etc? If you consider an AI harness a competitor, sure \u2014 but at the same time, not really. Chalie is designed to be a peer, not an employee. If you want an agent to just do work, use any of the ones mentioned earlier (Chalie itself is developed with Claude). But if you want a partner, a listener, a peer \u2014 then Chalie is for you.<p>Thanks :)","created_at":"2026-07-19T19:37:05Z","created_at_i":1784489825,"objectID":"48971068","parent_id":48971042,"story_id":48971042,"story_title":"Show HN: Chalie \u2013 AI peer not employee","story_url":"https://github.com/chalie-ai/chalie","updated_at":"2026-07-19T19:45:51Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"Capricorn2481"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"I'm not lamenting that they <em>aren't</em> close, I'm saying <em>Qwen</em> will frequently output code that isn't even syntactically correct, even when the syntax is simple. Which makes it unusable for <em>coding</em>."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Running Gemma 4 26B at 5 tokens/sec on a 13-year-old Xeon with no GPU"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://www.neomindlabs.com/2026/06/08/running-gemma-4-26b-at-5-tokens-sec-on-a-13-year-old-xeon-with-no-gpu/"}},"_tags":["comment","author_Capricorn2481","story_48922434"],"author":"Capricorn2481","children":[48926197,48926644],"comment_text":"I&#x27;m not lamenting that they aren&#x27;t close, I&#x27;m saying Qwen will frequently output code that isn&#x27;t even syntactically correct, even when the syntax is simple. Which makes it unusable for coding.","created_at":"2026-07-15T19:38:19Z","created_at_i":1784144299,"objectID":"48926023","parent_id":48924983,"story_id":48922434,"story_title":"Running Gemma 4 26B at 5 tokens/sec on a 13-year-old Xeon with no GPU","story_url":"https://www.neomindlabs.com/2026/06/08/running-gemma-4-26b-at-5-tokens-sec-on-a-13-year-old-xeon-with-no-gpu/","updated_at":"2026-07-16T11:15:53Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"sosodev"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"They\u2019re exaggerating or have a very simple way of using these models. The Gemma 4 series, even at 31B, is nowhere near the frontier. They\u2019re great models, but you will notice a huge difference for complex tasks.<p>The best local <em>agent</em>ic <em>coding</em> experience I\u2019ve had so far is <em>Qwen3</em>.6-27B with Pi."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Bonsai 27B: A 27B-Class model that runs on a phone"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://prismml.com/news/bonsai-27b"}},"_tags":["comment","author_sosodev","story_48910545"],"author":"sosodev","children":[48918757,48915534],"comment_text":"They\u2019re exaggerating or have a very simple way of using these models. The Gemma 4 series, even at 31B, is nowhere near the frontier. They\u2019re great models, but you will notice a huge difference for complex tasks.<p>The best local agentic coding experience I\u2019ve had so far is Qwen3.6-27B with Pi.","created_at":"2026-07-15T02:19:06Z","created_at_i":1784081946,"objectID":"48915492","parent_id":48915338,"story_id":48910545,"story_title":"Bonsai 27B: A 27B-Class model that runs on a phone","story_url":"https://prismml.com/news/bonsai-27b","updated_at":"2026-07-17T10:36:41Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"sosodev"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"<em>Qwen3</em>.6-27B is the best model in that range that I\u2019ve used for <em>agent</em>ic <em>coding</em> by far. I think it\u2019s kinda mid at everything else."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Bonsai 27B: A 27B-Class model that runs on a phone"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://prismml.com/news/bonsai-27b"}},"_tags":["comment","author_sosodev","story_48910545"],"author":"sosodev","children":[48917367],"comment_text":"Qwen3.6-27B is the best model in that range that I\u2019ve used for agentic coding by far. I think it\u2019s kinda mid at everything else.","created_at":"2026-07-15T02:16:28Z","created_at_i":1784081788,"objectID":"48915473","parent_id":48914739,"story_id":48910545,"story_title":"Bonsai 27B: A 27B-Class model that runs on a phone","story_url":"https://prismml.com/news/bonsai-27b","updated_at":"2026-07-15T07:23:06Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"rurban"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"GPT-OSS was by far the worst local model in 2026. I have the luxury of plenty of H100's to try them, and Gpt-oss was atrocious with a <em>coding</em> <em>agent</em>. Gemma4 and Deepseek 4 are the best for <em>coding</em>, <em>Qwen</em> for images."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Hy3"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://hy.tencent.com/research/hy3"}},"_tags":["comment","author_rurban","story_48847552"],"author":"rurban","comment_text":"GPT-OSS was by far the worst local model in 2026. I have the luxury of plenty of H100&#x27;s to try them, and Gpt-oss was atrocious with a coding agent. Gemma4 and Deepseek 4 are the best for coding, Qwen for images.","created_at":"2026-07-12T18:18:30Z","created_at_i":1783880310,"objectID":"48883200","parent_id":48849971,"story_id":48847552,"story_title":"Hy3","story_url":"https://hy.tencent.com/research/hy3","updated_at":"2026-07-12T18:21:55Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"bigyabai"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Outperforms doing what? Inference is not a homogeneous workload, memory bandwidth correlates to decode speed and layer swapping but not necessarily inference speed overall.<p>The other half of that equation is latency, predicated on prefill performance which needs a powerful GPU and ideally ALU-level optimization to build larger KV caches quickly. Even the M5 gets smoked in this department, the M5 Max has a 50% longer TTFT on <em>Qwen</em>'s 27b dense model at only 16k of context, which is a pretty typical <i>starting</i> context to use for <em>agent</em>ic editing in normal apps like OpenCode/Claude Code: <a href=\"https://raw.githubusercontent.com/Osmantic/MMBT-Messy-Model-Bench-Tests/0adec7c5e9a91ef19b0f1385e9d9d2b589fee45c/hardware-tests/qwen3.6-q8-fleet-2026-05-17/aggregate/canonical-headline.json\" rel=\"nofollow\">https://raw.githubusercontent.com/Osmantic/MMBT-Messy-Model-...</a><p>For <em>agent</em>ic, 50-256k token on-device <em>coding</em> sessions, the Spark will be faster and consume less power running larger models. Without an external GPU (which Apple doesn't support), Apple Silicon will always be bottlenecked during prefill. Apple's failure to address this with their GPU architecture is a big reason why Apple Silicon viewed as a waste of time and money for professional datacenter deployment."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Apple Silicon Exec Explains Mac Mini AI Demand and On-Device Future"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://www.macrumors.com/2026/07/06/apple-silicon-exec-explains-mac-mini-ai-demand/"}},"_tags":["comment","author_bigyabai","story_48805598"],"author":"bigyabai","children":[48869561],"comment_text":"Outperforms doing what? Inference is not a homogeneous workload, memory bandwidth correlates to decode speed and layer swapping but not necessarily inference speed overall.<p>The other half of that equation is latency, predicated on prefill performance which needs a powerful GPU and ideally ALU-level optimization to build larger KV caches quickly. Even the M5 gets smoked in this department, the M5 Max has a 50% longer TTFT on Qwen&#x27;s 27b dense model at only 16k of context, which is a pretty typical <i>starting</i> context to use for agentic editing in normal apps like OpenCode&#x2F;Claude Code: <a href=\"https:&#x2F;&#x2F;raw.githubusercontent.com&#x2F;Osmantic&#x2F;MMBT-Messy-Model-Bench-Tests&#x2F;0adec7c5e9a91ef19b0f1385e9d9d2b589fee45c&#x2F;hardware-tests&#x2F;qwen3.6-q8-fleet-2026-05-17&#x2F;aggregate&#x2F;canonical-headline.json\" rel=\"nofollow\">https:&#x2F;&#x2F;raw.githubusercontent.com&#x2F;Osmantic&#x2F;MMBT-Messy-Model-...</a><p>For agentic, 50-256k token on-device coding sessions, the Spark will be faster and consume less power running larger models. Without an external GPU (which Apple doesn&#x27;t support), Apple Silicon will always be bottlenecked during prefill. Apple&#x27;s failure to address this with their GPU architecture is a big reason why Apple Silicon viewed as a waste of time and money for professional datacenter deployment.","created_at":"2026-07-10T19:57:27Z","created_at_i":1783713447,"objectID":"48864460","parent_id":48863996,"story_id":48805598,"story_title":"Apple Silicon Exec Explains Mac Mini AI Demand and On-Device Future","story_url":"https://www.macrumors.com/2026/07/06/apple-silicon-exec-explains-mac-mini-ai-demand/","updated_at":"2026-07-11T07:13:37Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"girvo"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Having heavily evaluated both antirez\u2019s ds4 flash and <em>Qwen</em> 3.6 27B at FP8 and Q8: it depends. The quantised Flash is better in a number of tasks despite running much slower on my DGX Spark-alike.<p>27B is amazing for its size but has some surprising limits when used for longer <em>agent</em>ic <em>coding</em> sessions, especially if you\u2019re using tools that are outside the stock standard web tech stuff: it really isn\u2019t good at Relay, for example."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Hy3"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://hy.tencent.com/research/hy3"}},"_tags":["comment","author_girvo","story_48847552"],"author":"girvo","comment_text":"Having heavily evaluated both antirez\u2019s ds4 flash and Qwen 3.6 27B at FP8 and Q8: it depends. The quantised Flash is better in a number of tasks despite running much slower on my DGX Spark-alike.<p>27B is amazing for its size but has some surprising limits when used for longer agentic coding sessions, especially if you\u2019re using tools that are outside the stock standard web tech stuff: it really isn\u2019t good at Relay, for example.","created_at":"2026-07-09T21:15:10Z","created_at_i":1783631710,"objectID":"48852474","parent_id":48848477,"story_id":48847552,"story_title":"Hy3","story_url":"https://hy.tencent.com/research/hy3","updated_at":"2026-07-11T13:15:36Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"goodmattg"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"I flip back and forth between whoever currently has the more powerful frontier model that isn't cost prohibitive - subscriptions only, API pricing a non-starter. Today that's Fable 5 which has been excellent, as soon as it's Sol I'll switch to that. The OAI/Anthropic harness behavior has mostly stabilized for me with consistent <em>AGENTS</em>.md that I sync with CLAUDE.md - I like pi (pi.dev) and have tried to build it up to get performance comparable to the two &quot;first-party&quot; harnesses, I'm just not there yet.<p>One major sticking criteria for not going with OpenCode / pi for all of my <em>coding</em> is I want access to the tier-1 frontier model of the day without API pricing - e.g. afaik I can't use Fable 5 via pi harness even though I have a subscription, so for this week I'm on Claude Code. It's not the need to Fable 5 for everything, but even if I just want the marginal intelligence benefit to stress test an architecture decision, it's a safety blanket to know there isn't a ~smarter~ model I could have used. And for my use cases, the doggedness and capability of these frontier models has been insanely effective.<p>My feeling is we're still in the Uber era subsidy period - the moment the subscriptions either try to lock me in longer than a month or stop OAI/Anthropic stop delivering frontier models in the subscriptions, I'm out - switching fully over to pi.dev or another OS harness and routing my token spend via OpenRouter or offloading to <em>Qwen</em> locally. Then I'll have to put an accurate dollar amount on frontier intelligence."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"GPT-5.6"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://openai.com/index/gpt-5-6/"}},"_tags":["comment","author_goodmattg","story_48849066"],"author":"goodmattg","children":[48851861,48850703,48856626,48852231],"comment_text":"I flip back and forth between whoever currently has the more powerful frontier model that isn&#x27;t cost prohibitive - subscriptions only, API pricing a non-starter. Today that&#x27;s Fable 5 which has been excellent, as soon as it&#x27;s Sol I&#x27;ll switch to that. The OAI&#x2F;Anthropic harness behavior has mostly stabilized for me with consistent AGENTS.md that I sync with CLAUDE.md - I like pi (pi.dev) and have tried to build it up to get performance comparable to the two &quot;first-party&quot; harnesses, I&#x27;m just not there yet.<p>One major sticking criteria for not going with OpenCode &#x2F; pi for all of my coding is I want access to the tier-1 frontier model of the day without API pricing - e.g. afaik I can&#x27;t use Fable 5 via pi harness even though I have a subscription, so for this week I&#x27;m on Claude Code. It&#x27;s not the need to Fable 5 for everything, but even if I just want the marginal intelligence benefit to stress test an architecture decision, it&#x27;s a safety blanket to know there isn&#x27;t a ~smarter~ model I could have used. And for my use cases, the doggedness and capability of these frontier models has been insanely effective.<p>My feeling is we&#x27;re still in the Uber era subsidy period - the moment the subscriptions either try to lock me in longer than a month or stop OAI&#x2F;Anthropic stop delivering frontier models in the subscriptions, I&#x27;m out - switching fully over to pi.dev or another OS harness and routing my token spend via OpenRouter or offloading to Qwen locally. Then I&#x27;ll have to put an accurate dollar amount on frontier intelligence.","created_at":"2026-07-09T18:49:41Z","created_at_i":1783622981,"objectID":"48850666","parent_id":48849066,"story_id":48849066,"story_title":"GPT-5.6","story_url":"https://openai.com/index/gpt-5-6/","updated_at":"2026-07-10T12:56:49Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"Catloafdev"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"For most <em>coding</em> or <em>agent</em>ic tasks, <em>Qwen</em> 3.6 27B likely outperforms, yes.<p>For 'general intelligence', DS4 Flash seems to be a noticeable step up still."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Hy3"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://hy.tencent.com/research/hy3"}},"_tags":["comment","author_Catloafdev","story_48847552"],"author":"Catloafdev","comment_text":"For most coding or agentic tasks, Qwen 3.6 27B likely outperforms, yes.<p>For &#x27;general intelligence&#x27;, DS4 Flash seems to be a noticeable step up still.","created_at":"2026-07-09T16:50:46Z","created_at_i":1783615846,"objectID":"48848872","parent_id":48848477,"story_id":48847552,"story_title":"Hy3","story_url":"https://hy.tencent.com/research/hy3","updated_at":"2026-07-10T18:52:34Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"cmar00"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"1. Farewell deep focus. Before, I could work in flow state for hours on end. Now, waiting for the prompt to finish introduces constant interruptions, which translates into constant distractions, constant context switch, infinite ideas for endless new projects. My brain is melting. I can literally feel when my neurons start clogging up, forcing me to take frequent pauses just staring outside the window for 10 minutes straight.<p>2. I don't own my own projects anymore. Before, I used to understand every single little implementation detail of a project. When a client reported a bug, I immediately knew what and where to fix in the code base. Now, all I can do is to clasp my hands and recite a prayer to the LLM deity, hoping that my voice will be heard.<p>3. The overlords decide my fate. This is a corollary of #2. There's an outage every other day, and since I don't understand 80% of the code the AI has written, no AI available means it's literally impossible for me to work. I depend completely on its availability.<p>4. It's mostly right, but never quite there. I really make an effort to write descriptive prompts, extensive documentation, and various design docs. All of this to make it easier for the AI to understand the context and philosophy of the project. I write <em>AGENTS</em>.md, skills and rules. Yet, it can easily do 80% of the work correctly, but it always makes some wrong assumptions that make the other 20% very hard to fix. You have to held it's hand over every single little bug that it can't automatically fix. Old, already fixed bugs keep resurfacing out of nowhere. It's like Penelope's shroud: it looks like it fixed the current problem, but an old fix just <em>wen</em>'t undone. No real progress was made.<p>This is my experience with <em>coding</em> <em>agents</em>, please tell me your own. I'd love this to be a discussion.<p>P.S. This post was handwritten by a human being."},"title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["coding","agent"],"value":"Ask HN: I hate <em>coding</em> <em>agents</em>. Is this skill issue?"}},"_tags":["story","author_cmar00","story_48844345","ask_hn"],"author":"cmar00","children":[48857217,48844586,48914358,48865487,48872733,48852331,48844774,48875694,48866830,48861430,48852439,48953430,48931543,48845822,48857274,48866120,48870514,48872759,48862713],"created_at":"2026-07-09T11:47:27Z","created_at_i":1783597647,"num_comments":26,"objectID":"48844345","points":18,"story_id":48844345,"story_text":"1. Farewell deep focus. Before, I could work in flow state for hours on end. Now, waiting for the prompt to finish introduces constant interruptions, which translates into constant distractions, constant context switch, infinite ideas for endless new projects. My brain is melting. I can literally feel when my neurons start clogging up, forcing me to take frequent pauses just staring outside the window for 10 minutes straight.<p>2. I don&#x27;t own my own projects anymore. Before, I used to understand every single little implementation detail of a project. When a client reported a bug, I immediately knew what and where to fix in the code base. Now, all I can do is to clasp my hands and recite a prayer to the LLM deity, hoping that my voice will be heard.<p>3. The overlords decide my fate. This is a corollary of #2. There&#x27;s an outage every other day, and since I don&#x27;t understand 80% of the code the AI has written, no AI available means it&#x27;s literally impossible for me to work. I depend completely on its availability.<p>4. It&#x27;s mostly right, but never quite there. I really make an effort to write descriptive prompts, extensive documentation, and various design docs. All of this to make it easier for the AI to understand the context and philosophy of the project. I write AGENTS.md, skills and rules. Yet, it can easily do 80% of the work correctly, but it always makes some wrong assumptions that make the other 20% very hard to fix. You have to held it&#x27;s hand over every single little bug that it can&#x27;t automatically fix. Old, already fixed bugs keep resurfacing out of nowhere. It&#x27;s like Penelope&#x27;s shroud: it looks like it fixed the current problem, but an old fix just wen&#x27;t undone. No real progress was made.<p>This is my experience with coding agents, please tell me your own. I&#x27;d love this to be a discussion.<p>P.S. This post was handwritten by a human being.","title":"Ask HN: I hate coding agents. Is this skill issue?","updated_at":"2026-07-17T23:36:44Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"brainless"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"I have a project on very similar lines, <a href=\"https://github.com/brainless/dwata\" rel=\"nofollow\">https://github.com/brainless/dwata</a>, which I have not been developing for the past few months. I have been meaning to get back to it and I really like what I see on your project page.<p>My aim is to build a truly local app using only tiny/small models. I have had really good results from <em>Qwen</em> 3.5 1B, 4B, etc. Also, Gliner or similar models for different uses. SQLite + sqlite-vec + Tantivy + a tiny embedding model will stay as my go to.<p>In my case, <em>coding</em> <em>agent</em> is a separate product. I have <a href=\"https://github.com/brainless/nocodo\" rel=\"nofollow\">https://github.com/brainless/nocodo</a> for that. nocodo is also built for tiny/small models from the ground up. And recently I started building a wGPU based UI framework to build both these apps as native UI apps in Rust: <a href=\"https://github.com/brainless/akar\" rel=\"nofollow\">https://github.com/brainless/akar</a>. I also want e2e encrypted team/family sharing in my products.<p>Thank you for the inspiration. Would love to share notes and follow your progress."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Show HN: Rowboat \u2013 Open-source, local-first alternative to Claude Desktop"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/rowboatlabs/rowboat"}},"_tags":["comment","author_brainless","story_48819808"],"author":"brainless","children":[48832839],"comment_text":"I have a project on very similar lines, <a href=\"https:&#x2F;&#x2F;github.com&#x2F;brainless&#x2F;dwata\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;brainless&#x2F;dwata</a>, which I have not been developing for the past few months. I have been meaning to get back to it and I really like what I see on your project page.<p>My aim is to build a truly local app using only tiny&#x2F;small models. I have had really good results from Qwen 3.5 1B, 4B, etc. Also, Gliner or similar models for different uses. SQLite + sqlite-vec + Tantivy + a tiny embedding model will stay as my go to.<p>In my case, coding agent is a separate product. I have <a href=\"https:&#x2F;&#x2F;github.com&#x2F;brainless&#x2F;nocodo\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;brainless&#x2F;nocodo</a> for that. nocodo is also built for tiny&#x2F;small models from the ground up. And recently I started building a wGPU based UI framework to build both these apps as native UI apps in Rust: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;brainless&#x2F;akar\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;brainless&#x2F;akar</a>. I also want e2e encrypted team&#x2F;family sharing in my products.<p>Thank you for the inspiration. Would love to share notes and follow your progress.","created_at":"2026-07-08T14:43:04Z","created_at_i":1783521784,"objectID":"48832659","parent_id":48819808,"story_id":48819808,"story_title":"Show HN: Rowboat \u2013 Open-source, local-first alternative to Claude Desktop","story_url":"https://github.com/rowboatlabs/rowboat","updated_at":"2026-07-08T15:12:43Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"SwellJoe"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Yeah, folks should be aware that if you're filling up the memory on a Strix Halo for an inference workload, you're going to be getting uncomfortably slow token rates. Like, DS4 (a 1-bit quantization of DeepSeek V4 Flash) runs at something like 9-13 tokens/second, with a loooong time to first token. It is not a realistic interactive <em>coding</em> model for <em>agent</em>ic use.<p>I like my Strix Halo and keep it chewing on stuff, mostly non-interactive workloads (security audits of software mostly, training experiments, etc.), I get a lot of use out of it. If you want to experiment with AI, it is a good platform for that, though at $4k you can get an Nvidia-based Asus Ascend GX10, which is probably better. But, if you want a local model for interactive <em>agent</em>ic use, you're going to be running either <em>Qwen</em> 3.6 or Gemma 4, which will fit comfortably on 2x64GB GPUs (even old GPUs will run them faster than the Strix Halo...I have dual Radeon Pro V620s which are faster, and they're six years old), or snugly on 32GB. A 48GB or 64GB Mac would run them well. Two Radeon AI Pro R9700 GPUs is probably the sweet spot, right now for GPUs. Not the cost of a good used car, like a 5090 or 4090, but plenty of memory and performance for local inference. Also, not finicky and weird and needing custom 3D printed fan shrouds like the old server GPUs on eBay.<p>At the moment, there just isn't a model that works better on a 128GB inference machine like this that don't also work fine on 64GB machines, which may be faster (very few 32GB GPUs will be slower, though I wouldn't recommend buying any GPU that isn't currently actively supported by the vendor drivers and CUDA or ROCm...so probably don't buy an MI50 or V100 or whatever)."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"AMD Ryzen AI Halo \u2013 $4k AI Dev Kit"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://www.lttlabs.com/articles/2026/07/06/amd-ryzen-ai-halo"}},"_tags":["comment","author_SwellJoe","story_48805624"],"author":"SwellJoe","children":[48810878],"comment_text":"Yeah, folks should be aware that if you&#x27;re filling up the memory on a Strix Halo for an inference workload, you&#x27;re going to be getting uncomfortably slow token rates. Like, DS4 (a 1-bit quantization of DeepSeek V4 Flash) runs at something like 9-13 tokens&#x2F;second, with a loooong time to first token. It is not a realistic interactive coding model for agentic use.<p>I like my Strix Halo and keep it chewing on stuff, mostly non-interactive workloads (security audits of software mostly, training experiments, etc.), I get a lot of use out of it. If you want to experiment with AI, it is a good platform for that, though at $4k you can get an Nvidia-based Asus Ascend GX10, which is probably better. But, if you want a local model for interactive agentic use, you&#x27;re going to be running either Qwen 3.6 or Gemma 4, which will fit comfortably on 2x64GB GPUs (even old GPUs will run them faster than the Strix Halo...I have dual Radeon Pro V620s which are faster, and they&#x27;re six years old), or snugly on 32GB. A 48GB or 64GB Mac would run them well. Two Radeon AI Pro R9700 GPUs is probably the sweet spot, right now for GPUs. Not the cost of a good used car, like a 5090 or 4090, but plenty of memory and performance for local inference. Also, not finicky and weird and needing custom 3D printed fan shrouds like the old server GPUs on eBay.<p>At the moment, there just isn&#x27;t a model that works better on a 128GB inference machine like this that don&#x27;t also work fine on 64GB machines, which may be faster (very few 32GB GPUs will be slower, though I wouldn&#x27;t recommend buying any GPU that isn&#x27;t currently actively supported by the vendor drivers and CUDA or ROCm...so probably don&#x27;t buy an MI50 or V100 or whatever).","created_at":"2026-07-06T18:52:08Z","created_at_i":1783363928,"objectID":"48808880","parent_id":48806664,"story_id":48805624,"story_title":"AMD Ryzen AI Halo \u2013 $4k AI Dev Kit","story_url":"https://www.lttlabs.com/articles/2026/07/06/amd-ryzen-ai-halo","updated_at":"2026-07-07T18:18:09Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"yolo-auto"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Hi HN,<p>I was once given the advice: Don't waste expensive frontier model credits (GPT/Claude/etc.) on bulk work. Send the boring, repetitive, high-volume jobs to a smaller model, and save the expensive prompts for when you actually need frontier-level reasoning. I complained and told my manager that I shouldnt have to think about using certain models for certain <em>coding</em> tasks, and that one model should handle everything. Well, here we are anyway.<p>If anyone needs a place to absolutely abuse an LLM with high-volume tasks, come beat ours up at <a href=\"https://yolo-auto.com\" rel=\"nofollow\">https://yolo-auto.com</a>.<p>Here are the specs for $6/month:<p>- Model: <em>Qwen3</em>.6-35B-A3B\n- Unlimited tokens / No request caps\n- FP8 / 128k context\n- OpenAI-compatible endpoint\n- ~100 tokens/sec average\n- 100% private (zero data retention)<p>We also have a free tier that gives you 500 requests a day going on.<p>We've got around 100 active users so far. If you're skeptical about the unlimited claim, jump into our Discord and ask them\u2014we've got people burning hundreds of millions of tokens a day doing <em>agent</em> experiments, bulk <em>coding</em>, data processing, and all kinds of nonsense.<p>We're also just about to finish our first AI game-dev &quot;SlopJam,&quot; where people had 72 hours to build the most cursed AI-generated game they could. It was way more fun than we expected.<p>Drop a question or comment below, happy to answer anything!"},"title":{"matchLevel":"none","matchedWords":[],"value":"Show HN: An unmetered LLM API\u2013$6/month, no token tracking, no limits"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://yolo-auto.com/"}},"_tags":["story","author_yolo-auto","story_48799719","show_hn"],"author":"yolo-auto","children":[48800847,48810987,48809319,48812953,48818137,48817667,48800243,48802591],"created_at":"2026-07-06T01:22:10Z","created_at_i":1783300930,"num_comments":13,"objectID":"48799719","points":12,"story_id":48799719,"story_text":"Hi HN,<p>I was once given the advice: Don&#x27;t waste expensive frontier model credits (GPT&#x2F;Claude&#x2F;etc.) on bulk work. Send the boring, repetitive, high-volume jobs to a smaller model, and save the expensive prompts for when you actually need frontier-level reasoning. I complained and told my manager that I shouldnt have to think about using certain models for certain coding tasks, and that one model should handle everything. Well, here we are anyway.<p>If anyone needs a place to absolutely abuse an LLM with high-volume tasks, come beat ours up at <a href=\"https:&#x2F;&#x2F;yolo-auto.com\" rel=\"nofollow\">https:&#x2F;&#x2F;yolo-auto.com</a>.<p>Here are the specs for $6&#x2F;month:<p>- Model: Qwen3.6-35B-A3B\n- Unlimited tokens &#x2F; No request caps\n- FP8 &#x2F; 128k context\n- OpenAI-compatible endpoint\n- ~100 tokens&#x2F;sec average\n- 100% private (zero data retention)<p>We also have a free tier that gives you 500 requests a day going on.<p>We&#x27;ve got around 100 active users so far. If you&#x27;re skeptical about the unlimited claim, jump into our Discord and ask them\u2014we&#x27;ve got people burning hundreds of millions of tokens a day doing agent experiments, bulk coding, data processing, and all kinds of nonsense.<p>We&#x27;re also just about to finish our first AI game-dev &quot;SlopJam,&quot; where people had 72 hours to build the most cursed AI-generated game they could. It was way more fun than we expected.<p>Drop a question or comment below, happy to answer anything!","title":"Show HN: An unmetered LLM API\u2013$6/month, no token tracking, no limits","updated_at":"2026-07-09T05:26:15Z","url":"https://yolo-auto.com/"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"Syntaf"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"Both very fair observations.<p>&gt; Whether this mode of working is going to be long-term viable is going to depend on how important it is for you to be aware of what has happened for the system in question<p>This is the million dollar we'll see answered in our lifetime. Software engineering exists to automate work, are we arrogant to think we are not destined to the same fate? Is this truly a job befitting of a human over an <em>agent</em>?<p>Ever since I discovered my dad's C++ book in highschool I've absolutely loved <em>coding</em>, but i'm not convinced I have a long stable career ahead of me in SWE -- I'm 30 now and have already seen so much change in the industry during my professional career.<p>&gt; how viable the economics are for the LLM usage at this level of assurance and how much ownership you exert over the LLM used or another similarly powered one<p>This piece scares me the most, a world where the next generation models are capped behind capital infeasible for the common person to access, further separating the ultra wealthy from what little remains of the middle class.<p>My hope is that open source models will fill the moat all of these AI companies so desperately want to dig, aready models like <em>Qwen</em> and Kimi are unfathomably better than what we had just a year or two ago."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Programmers need to start meditating"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://jacob.gold/posts/programmers-need-to-start-meditating-now/"}},"_tags":["comment","author_Syntaf","story_48792080"],"author":"Syntaf","children":[48796659],"comment_text":"Both very fair observations.<p>&gt; Whether this mode of working is going to be long-term viable is going to depend on how important it is for you to be aware of what has happened for the system in question<p>This is the million dollar we&#x27;ll see answered in our lifetime. Software engineering exists to automate work, are we arrogant to think we are not destined to the same fate? Is this truly a job befitting of a human over an agent?<p>Ever since I discovered my dad&#x27;s C++ book in highschool I&#x27;ve absolutely loved coding, but i&#x27;m not convinced I have a long stable career ahead of me in SWE -- I&#x27;m 30 now and have already seen so much change in the industry during my professional career.<p>&gt; how viable the economics are for the LLM usage at this level of assurance and how much ownership you exert over the LLM used or another similarly powered one<p>This piece scares me the most, a world where the next generation models are capped behind capital infeasible for the common person to access, further separating the ultra wealthy from what little remains of the middle class.<p>My hope is that open source models will fill the moat all of these AI companies so desperately want to dig, aready models like Qwen and Kimi are unfathomably better than what we had just a year or two ago.","created_at":"2026-07-05T15:07:28Z","created_at_i":1783264048,"objectID":"48794841","parent_id":48794059,"story_id":48792080,"story_title":"Programmers need to start meditating","story_url":"https://jacob.gold/posts/programmers-need-to-start-meditating-now/","updated_at":"2026-07-05T18:33:02Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"SwellJoe"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"I recently wrote up how I run local LLMs, because several folks had asked (<a href=\"https://swelljoe.com/post/how-i-run-local-llms/\" rel=\"nofollow\">https://swelljoe.com/post/how-i-run-local-llms/</a>) and I think even my setup, which I spent maybe $4200 on, half on a Strix Halo and half on upgrades for my desktop, would be too expensive to justify today. I bought before prices went through the roof, and only did so because I like to tinker with hardware...not because I expected it to ever pay for itself vs. buying subsidized tokens from the big guys or the cheap tokens from efficient providers like DeepSeek.<p>Buying four $13000 GPUs and several thousand dollars worth of supporting hardware seems crazy. This supply shortage has to end eventually, and I can buy billions of DeepSeek, MiMo, and GLM tokens, and use $100 or $200 a month subscriptions for the big guys in the meantime for the difference in price once that happens. And, you can't even run the full-sized GLM on that hardware, it is quantized and so is your KV cache; the degradation is small, but not non-existent. You're not running a model that's equal to what you get when you buy GLM tokens from Z.ai.<p>My recommendation for self-hosting is this: If you already have a 24GB or 32GB GPU, or two, or a recent Mac with 32GB or more, run the appropriate quantization of <em>Qwen</em> 3.6 27B or Gemma 4 31B. If your hardware is older and too slow for that, use the MoE, but know it'll be dumber. Use the tiny model for the stuff that doesn't need deep smarts: Research (give it a Brave or Exa MCP for web search), summarization, simple Python scripts for basic tasks, simple websites or web apps, categorization of stuff (I used Gemma 4 to review my past writing for friendliness and helpfulness), etc. It can also be a sub-<em>agent</em> for bigger agents (for those same kinds of tasks). Gemma 4 12B is an incredibly good model for its size, particularly for vision tasks, and in the 4-bit quantization (7GB on disk) it runs on anything, even a modern tablet or phone.<p>And, if you don't already have a big GPU or unified memory Mac, just wait. Use the cheap tokens every AI company wants to sell you, for now. A Claude or Codex or Gemini subscription is a good deal. Tokens from DeepSeek are a good deal, especially with Reasonix <em>agent</em> (which maximizes caching, which DeepSeek is uniquely good at, and cached tokens are uniquely cheap at DeepSeek). GLM is Good Enough and has a cheap <em>coding</em> plan. MiMo has the cheapest tokens for a 1T+ model in the game, though DeepSeek and GLM are better models, MiMo is fine.<p>When prices come down, I'll be speccing out a beast to run the big models, too. But, I'm not paying 4x for RAM and GPU and storage, and y'all shouldn't either. That's crazy. Computer prices go down over time. It is the law."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Jamesob's guide to running SOTA LLMs locally"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/jamesob/local-llm"}},"_tags":["comment","author_SwellJoe","story_48775921"],"author":"SwellJoe","children":[48781426],"comment_text":"I recently wrote up how I run local LLMs, because several folks had asked (<a href=\"https:&#x2F;&#x2F;swelljoe.com&#x2F;post&#x2F;how-i-run-local-llms&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;swelljoe.com&#x2F;post&#x2F;how-i-run-local-llms&#x2F;</a>) and I think even my setup, which I spent maybe $4200 on, half on a Strix Halo and half on upgrades for my desktop, would be too expensive to justify today. I bought before prices went through the roof, and only did so because I like to tinker with hardware...not because I expected it to ever pay for itself vs. buying subsidized tokens from the big guys or the cheap tokens from efficient providers like DeepSeek.<p>Buying four $13000 GPUs and several thousand dollars worth of supporting hardware seems crazy. This supply shortage has to end eventually, and I can buy billions of DeepSeek, MiMo, and GLM tokens, and use $100 or $200 a month subscriptions for the big guys in the meantime for the difference in price once that happens. And, you can&#x27;t even run the full-sized GLM on that hardware, it is quantized and so is your KV cache; the degradation is small, but not non-existent. You&#x27;re not running a model that&#x27;s equal to what you get when you buy GLM tokens from Z.ai.<p>My recommendation for self-hosting is this: If you already have a 24GB or 32GB GPU, or two, or a recent Mac with 32GB or more, run the appropriate quantization of Qwen 3.6 27B or Gemma 4 31B. If your hardware is older and too slow for that, use the MoE, but know it&#x27;ll be dumber. Use the tiny model for the stuff that doesn&#x27;t need deep smarts: Research (give it a Brave or Exa MCP for web search), summarization, simple Python scripts for basic tasks, simple websites or web apps, categorization of stuff (I used Gemma 4 to review my past writing for friendliness and helpfulness), etc. It can also be a sub-agent for bigger agents (for those same kinds of tasks). Gemma 4 12B is an incredibly good model for its size, particularly for vision tasks, and in the 4-bit quantization (7GB on disk) it runs on anything, even a modern tablet or phone.<p>And, if you don&#x27;t already have a big GPU or unified memory Mac, just wait. Use the cheap tokens every AI company wants to sell you, for now. A Claude or Codex or Gemini subscription is a good deal. Tokens from DeepSeek are a good deal, especially with Reasonix agent (which maximizes caching, which DeepSeek is uniquely good at, and cached tokens are uniquely cheap at DeepSeek). GLM is Good Enough and has a cheap coding plan. MiMo has the cheapest tokens for a 1T+ model in the game, though DeepSeek and GLM are better models, MiMo is fine.<p>When prices come down, I&#x27;ll be speccing out a beast to run the big models, too. But, I&#x27;m not paying 4x for RAM and GPU and storage, and y&#x27;all shouldn&#x27;t either. That&#x27;s crazy. Computer prices go down over time. It is the law.","created_at":"2026-07-03T20:17:54Z","created_at_i":1783109874,"objectID":"48779473","parent_id":48775921,"story_id":48775921,"story_title":"Jamesob's guide to running SOTA LLMs locally","story_url":"https://github.com/jamesob/local-llm","updated_at":"2026-07-04T16:49:58Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"automajicly"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["qwen","coding","agent"],"value":"I appreciate the insight. I might look at these a bit later... because I am swamped with this new build Im putting together. I havent yet found my so-called 'sweet spot' yet for larger tasks. I checked out the gemma4 variant the one or two times just for contextual reasoning and smaller build capabilities, but Ive only been testing single use isolated tool tests with my mcp pointed at it lately to ensure tools I've installed actually work before I incorporate them into the mcp server. No plug or anything, but just so you can get a feel of what I'm doing, I initially set out to learn <em>coding</em>. Then I fell into trying to build my own local model to help teach me <em>coding</em> (and anything else related) because tutorials and all these fancy webinars and such just put me to sleep. Once I discovered all of this AI stuff...? My interest in cybersecurity just skyrocketed. So....no life story or anything) ha... I started to build this mcp, pointed at LM Studio with <em>qwen</em> loaded as my first one. As I began to familiarise more I realised...I need a model that is free, local, will understand context, reasoning, can code, de-bug,vuln discovery- etc. etc. All of this sent me on this whole cybersecurity rabbit hole - as it does with any tech stuff, and now along with my need for a model to teach me <em>coding</em>(Python) I wanted a model to also HELP me build ITSELF basically. Because so far, Ive used good old Claude- Sonnet-5 straight from the mobile app for said task. Dont laugh. And I show the build,and upgrades of the mcp to GEMMA (initially <em>qwen</em>) and get fedback against the Cluade assisted builds. So I apologise for ruining your day, but... thats where Im headed. I need a model that will vibe with me, build with me, reason, comprehend context and also... not chew up my m1 16gb, or token usage. Well, if its free I suppose tokens <em>arent</em> a thing but they do sort of matter still. Thanks again. Your information is much appreciated and invaluable."},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["qwen"],"value":"<em>Qwen</em> 3.6 27B is the sweet spot for local development"},"story_url":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["qwen"],"value":"https://quesma.com/blog/<em>qwen</em>-36-is-awesome/"}},"_tags":["comment","author_automajicly","story_48721903"],"author":"automajicly","comment_text":"I appreciate the insight. I might look at these a bit later... because I am swamped with this new build Im putting together. I havent yet found my so-called &#x27;sweet spot&#x27; yet for larger tasks. I checked out the gemma4 variant the one or two times just for contextual reasoning and smaller build capabilities, but Ive only been testing single use isolated tool tests with my mcp pointed at it lately to ensure tools I&#x27;ve installed actually work before I incorporate them into the mcp server. No plug or anything, but just so you can get a feel of what I&#x27;m doing, I initially set out to learn coding. Then I fell into trying to build my own local model to help teach me coding (and anything else related) because tutorials and all these fancy webinars and such just put me to sleep. Once I discovered all of this AI stuff...? My interest in cybersecurity just skyrocketed. So....no life story or anything) ha... I started to build this mcp, pointed at LM Studio with qwen loaded as my first one. As I began to familiarise more I realised...I need a model that is free, local, will understand context, reasoning, can code, de-bug,vuln discovery- etc. etc. All of this sent me on this whole cybersecurity rabbit hole - as it does with any tech stuff, and now along with my need for a model to teach me coding(Python) I wanted a model to also HELP me build ITSELF basically. Because so far, Ive used good old Claude- Sonnet-5 straight from the mobile app for said task. Dont laugh. And I show the build,and upgrades of the mcp to GEMMA (initially qwen) and get fedback against the Cluade assisted builds. So I apologise for ruining your day, but... thats where Im headed. I need a model that will vibe with me, build with me, reason, comprehend context and also... not chew up my m1 16gb, or token usage. Well, if its free I suppose tokens arent a thing but they do sort of matter still. Thanks again. Your information is much appreciated and invaluable.","created_at":"2026-07-02T19:00:48Z","created_at_i":1783018848,"objectID":"48765884","parent_id":48750973,"story_id":48721903,"story_title":"Qwen 3.6 27B is the sweet spot for local development","story_url":"https://quesma.com/blog/qwen-36-is-awesome/","updated_at":"2026-07-15T10:40:36Z"}],"hitsPerPage":20,"nbHits":405,"nbPages":21,"page":0,"params":"query=qwen+coding+agent&advancedSyntax=true&analyticsTags=backend","processingTimeMS":15,"processingTimingsMS":{"_request":{"roundTrip":15},"afterFetch":{"format":{"highlighting":3,"total":4}},"fetch":{"query":8,"scanning":4,"total":13},"total":15},"query":"qwen coding agent","serverTimeMS":19}