{"exhaustive":{"nbHits":false,"typo":false},"exhaustiveNbHits":false,"exhaustiveTypo":false,"hits":[{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"rglullis"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"The point you seem to be missing is that focusing only on <em>optimization</em> makes us all fragile to system shocks.<p>&gt; Businesses don\u2019t exercise (or perhaps even train) this process because it\u2019s just not needed enough to warrant the <em>cost</em>.<p>Until a crisis hits. Covid and supply chain failures. Iran war and straight of Hormuz. Prolonged War in Europe with no production pipeline available. Banks collapsing after unsustainable overleveraging in supposedly &quot;safe&quot; mortgages.<p>For every <em>optimization</em> and <em>cost</em>-saving measure that is deployed, there should be a backup plan in place. MBA types and &quot;technologists&quot; keep missing this. What is the backup plan for the case where most of the economy activity is built on software produced by business who overleveraged on <em>LLM</em> for code generation?"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"The West Forgot How to Make Things. Now It's Forgetting How to Code"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://techtrenches.dev/p/the-west-forgot-how-to-make-things"}},"_tags":["comment","author_rglullis","story_47907879"],"author":"rglullis","comment_text":"The point you seem to be missing is that focusing only on optimization makes us all fragile to system shocks.<p>&gt; Businesses don\u2019t exercise (or perhaps even train) this process because it\u2019s just not needed enough to warrant the cost.<p>Until a crisis hits. Covid and supply chain failures. Iran war and straight of Hormuz. Prolonged War in Europe with no production pipeline available. Banks collapsing after unsustainable overleveraging in supposedly &quot;safe&quot; mortgages.<p>For every optimization and cost-saving measure that is deployed, there should be a backup plan in place. MBA types and &quot;technologists&quot; keep missing this. What is the backup plan for the case where most of the economy activity is built on software produced by business who overleveraged on LLM for code generation?","created_at":"2026-04-26T08:57:43Z","created_at_i":1777193863,"objectID":"47908655","parent_id":47908242,"story_id":47907879,"story_title":"The West Forgot How to Make Things. Now It's Forgetting How to Code","story_url":"https://techtrenches.dev/p/the-west-forgot-how-to-make-things","updated_at":"2026-04-26T08:58:51Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"tdi"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"From $200 to $30: Five Layers of <em>LLM</em> <em>Cost</em> <em>Optimization</em>"},"url":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["llm","optimization"],"value":"http://blog.dwornikowski.com/posts/cutting-<em>llm</em>-costs-token-<em>optimization</em>/"}},"_tags":["story","author_tdi","story_47900746"],"author":"tdi","children":[47901177,47900928],"created_at":"2026-04-25T11:56:31Z","created_at_i":1777118191,"num_comments":3,"objectID":"47900746","points":11,"story_id":47900746,"title":"From $200 to $30: Five Layers of LLM Cost Optimization","updated_at":"2026-04-26T11:47:37Z","url":"http://blog.dwornikowski.com/posts/cutting-llm-costs-token-optimization/"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"thiago_fm"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"Thanks for your perspective from somebody working on the field. I still wonder though: what would the results be if we'd just use a richer dataset + more parameters? Would it be really that different results? (except <em>costs</em>, as MoE def helps with that)<p>MoE: I assume some people just specialize in working with routing as with that, as by reducing the amount of params and just using a subset, you end up making it less costly. So, AI researchers are only working on <em>optimizations</em> on getting this better?<p>Same question on Reasoning, so AI researchers are working mostly on <em>optimizations</em> on top of it, like CoT and so on, like mini-<em>optimizations</em>.<p>So basically, they work on those micro-<em>optimizations</em>, put them together and see a % improvement in a benchmark?<p>I'm sure this is probably awesome for languages, which if I'm not mistaken, it was the use-case initially used on &quot;All you need is attention&quot; and the entire <em>LLM</em> revolution.<p>But this seems to be a very clear path to be &quot;taking the car to the carwash by foot&quot; for a long time, isn't it?<p>It feels like we'll keep &quot;taking the car to the carwash by foot&quot; until somebody optimizes for that prompt, or some pre-training done, and then there'll be another prompt that will show that the AI has real trouble with very basic real-world reasoning and imagination.<p>Isn't it the case, or do you see any kind of research that could take us from that plateau full of micro-<em>optimizations</em> that get us a few cm higher to the peak?"},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["llm"],"value":"Ask HN: Is the ongoing AI research driving <em>LLM</em> models to be better?"}},"_tags":["comment","author_thiago_fm","story_47872916"],"author":"thiago_fm","children":[47873670],"comment_text":"Thanks for your perspective from somebody working on the field. I still wonder though: what would the results be if we&#x27;d just use a richer dataset + more parameters? Would it be really that different results? (except costs, as MoE def helps with that)<p>MoE: I assume some people just specialize in working with routing as with that, as by reducing the amount of params and just using a subset, you end up making it less costly. So, AI researchers are only working on optimizations on getting this better?<p>Same question on Reasoning, so AI researchers are working mostly on optimizations on top of it, like CoT and so on, like mini-optimizations.<p>So basically, they work on those micro-optimizations, put them together and see a % improvement in a benchmark?<p>I&#x27;m sure this is probably awesome for languages, which if I&#x27;m not mistaken, it was the use-case initially used on &quot;All you need is attention&quot; and the entire LLM revolution.<p>But this seems to be a very clear path to be &quot;taking the car to the carwash by foot&quot; for a long time, isn&#x27;t it?<p>It feels like we&#x27;ll keep &quot;taking the car to the carwash by foot&quot; until somebody optimizes for that prompt, or some pre-training done, and then there&#x27;ll be another prompt that will show that the AI has real trouble with very basic real-world reasoning and imagination.<p>Isn&#x27;t it the case, or do you see any kind of research that could take us from that plateau full of micro-optimizations that get us a few cm higher to the peak?","created_at":"2026-04-23T08:51:25Z","created_at_i":1776934285,"objectID":"47873548","parent_id":47873222,"story_id":47872916,"story_title":"Ask HN: Is the ongoing AI research driving LLM models to be better?","updated_at":"2026-04-23T09:16:40Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"pama"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"Not sure what you mean by efficiency as this was part of the article and I understand things differently\u2014can you clarify? For the energy of 20 W in an hour on a laptop\u2019s M4 pro, this model produces about 200k tokens (a book or two) at a typical electricity <em>cost</em> of less than a third of a US cent. Although clearly the intelligence of this particular model is unrelated to human intelligence, I always thought that there is no comparison between <em>LLMs</em> and humans in terms of efficiency: these models are way less energy expensive than humans. If you were to use data center scale <em>optimizations</em>, then serving <em>LLMs</em> is many additional orders of magnitude more efficient than serving <em>LLMs</em> at home. (The energy <em>cost</em> of inference on the M4 pro and iphone are listed in the article.)"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Ternary Bonsai: Top Intelligence at 1.58 Bits"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://prismml.com/news/ternary-bonsai"}},"_tags":["comment","author_pama","story_47812749"],"author":"pama","comment_text":"Not sure what you mean by efficiency as this was part of the article and I understand things differently\u2014can you clarify? For the energy of 20 W in an hour on a laptop\u2019s M4 pro, this model produces about 200k tokens (a book or two) at a typical electricity cost of less than a third of a US cent. Although clearly the intelligence of this particular model is unrelated to human intelligence, I always thought that there is no comparison between LLMs and humans in terms of efficiency: these models are way less energy expensive than humans. If you were to use data center scale optimizations, then serving LLMs is many additional orders of magnitude more efficient than serving LLMs at home. (The energy cost of inference on the M4 pro and iphone are listed in the article.)","created_at":"2026-04-21T12:23:12Z","created_at_i":1776774192,"objectID":"47847787","parent_id":47844141,"story_id":47812749,"story_title":"Ternary Bonsai: Top Intelligence at 1.58 Bits","story_url":"https://prismml.com/news/ternary-bonsai","updated_at":"2026-04-21T18:24:04Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"nyrikki"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"Parallelism can be tricky and always has a <em>cost</em>, but don't discount the 3090 which is more expensive these days in that price bracket.<p>3090 llama.cpp (container in VM)<p><pre><code>    unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL  105 t/s\n    unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL  103 t/s\n</code></pre>\nStill slow compaired to the<p><pre><code>    ggml-org/gpt-oss-20b-GGUF 206 t/s\n</code></pre>\nBut on my 3x 1080 Ti 1x TITAN V getto machine I learned that multi gpu takes a lot of tuning no matter what. With the B70, where Vulkan has the CPU copy problem, and SYCL doesn't have a sponsor or enough volunteers, it will probably take a bit of profiling on your part.<p>There are a lot of variables, but PCIe bus speed doesn't matter that much for inference, but the internal memory bandwidth does, and you won't match that with PCIe ever.<p>To be clear, multicard Vulkan and absolutely SYCL have a lot of <em>optimizations</em> that could happen, but the only time two GPUs are really faster for inference is when one doesn't have enough ram to fit the entire model.<p>A 3090 has 936.2 GB/s of (low latency) internal bandwidth, while 16xPCIe5 only has 504.12, may have to be copied through the CPU, have locks, atomic operations etc...<p>For <em>LLM</em> inference, the bottleneck just usually going to be memory bandwidth which is why my 3090 is so close to the 5070ti above.<p><em>LLM</em> next token prediction is just a form of autoregressive decoding and will primarily be memory bound.<p>As I haven't used the larger intel GPUs I can't comment on what still needs to be optimized, but just don't expect multiple GPUs to increase performance without some nvlink style RDMA support _unless_ your process is compute and not memory bound."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Qwen3.6-35B-A3B: Agentic coding power, now open to all"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://qwen.ai/blog?id=qwen3.6-35b-a3b"}},"_tags":["comment","author_nyrikki","story_47792764"],"author":"nyrikki","comment_text":"Parallelism can be tricky and always has a cost, but don&#x27;t discount the 3090 which is more expensive these days in that price bracket.<p>3090 llama.cpp (container in VM)<p><pre><code>    unsloth&#x2F;Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL  105 t&#x2F;s\n    unsloth&#x2F;gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL  103 t&#x2F;s\n</code></pre>\nStill slow compaired to the<p><pre><code>    ggml-org&#x2F;gpt-oss-20b-GGUF 206 t&#x2F;s\n</code></pre>\nBut on my 3x 1080 Ti 1x TITAN V getto machine I learned that multi gpu takes a lot of tuning no matter what. With the B70, where Vulkan has the CPU copy problem, and SYCL doesn&#x27;t have a sponsor or enough volunteers, it will probably take a bit of profiling on your part.<p>There are a lot of variables, but PCIe bus speed doesn&#x27;t matter that much for inference, but the internal memory bandwidth does, and you won&#x27;t match that with PCIe ever.<p>To be clear, multicard Vulkan and absolutely SYCL have a lot of optimizations that could happen, but the only time two GPUs are really faster for inference is when one doesn&#x27;t have enough ram to fit the entire model.<p>A 3090 has 936.2 GB&#x2F;s of (low latency) internal bandwidth, while 16xPCIe5 only has 504.12, may have to be copied through the CPU, have locks, atomic operations etc...<p>For LLM inference, the bottleneck just usually going to be memory bandwidth which is why my 3090 is so close to the 5070ti above.<p>LLM next token prediction is just a form of autoregressive decoding and will primarily be memory bound.<p>As I haven&#x27;t used the larger intel GPUs I can&#x27;t comment on what still needs to be optimized, but just don&#x27;t expect multiple GPUs to increase performance without some nvlink style RDMA support _unless_ your process is compute and not memory bound.","created_at":"2026-04-17T00:05:25Z","created_at_i":1776384325,"objectID":"47801047","parent_id":47796553,"story_id":47792764,"story_title":"Qwen3.6-35B-A3B: Agentic coding power, now open to all","story_url":"https://qwen.ai/blog?id=qwen3.6-35b-a3b","updated_at":"2026-04-17T12:50:34Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"mowmiatlas"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"Extra context since the post got long. A few things that ate more time than I expected:<p>Streaming was the worst one. Kokoro doesn't expose a streaming interface as far as I could find, you hand it a chunk of text, it gives you back the full audio for that chunk. For a reading app you can't wait for a whole paragraph before playback starts, so the whole streaming layer had to be built on top. I didn't want to process the book then serve full audio, I wanted it to be interactive.<p>The basic shape: chunk into sentence-sized windows, render in the background, queue rendered chunks for playback, keep a small pre-render lookahead so playback never starves but the phone isn't speculatively rendering an entire chapter it might throw away on a skip.<p>Sentence chunking was its own fight. Too long and the model returns null and playback stops. Too short (four or five words at a time) and the naturalness diminishes, because the model uses context within a sentence to decide intonation. Chopped chunks sound like a bad GPS voice. I had to find the goldilocks window where the model is happy and the result still sounds good and handle long-sentence edge cases by splitting on secondary punctuation and stitching the audio back together without audible seams.<p>For battery-life there's cruise mode. When the screen is off and the next several sentences are already rendered and cached, the app swaps the whole synthesis/playback pipeline for a much lighter sequential AAC player, hardware-decoded audio files.<p>When the phone's on a charger, a background task pre-renders a chapter or two of upcoming audio and writes it to disk as M4A. That way, by the time you're actually reading, cruise mode has a cache to play from and the neural engine never has to wake up for long stretches. The system decides when to actually run the task, so it piggybacks on the phone's usual overnight charging window.<p>The Neural Engine was a disappointment. I was hoping to get Kokoro onto the ANE for the latency/efficiency win, seeing it works quite well on CPU, but it uses ops that CoreML doesn't route to the Neural Engine, so it falls back to GPU/CPU. The weird part: forcing .cpuAndNeuralEngine is actually slower than .cpuAndGPU on this model, probably partitioning <em>cost</em> from unsupported ops bouncing between compute units, but I don't fully understand why. If anyone on CoreML has a principled explanation I'd love to hear it.<p>iPhone 12 mini and lower, and simulators are cursed. They seem to run Kokoro successfully, i.e. no error, inference completes but the result is pure crackling/screeching gibberish audio. Same model, same weights, same code path. KittenTTS runs fine on the exact same hardware AND the XCode simulator. I still don't know what's going on here; Curious if anyone's seen similar.<p>KittenTTS was easy. Ported it as a fallback for older devices and published a minimal iOS example repo while I was at it: <a href=\"https://github.com/pepinu/KittenTTS-iOS\" rel=\"nofollow\">https://github.com/pepinu/KittenTTS-iOS</a> if you just want to see how to get a neural TTS model running on iPhone without the full app machinery around it.<p>Before I got  the iPhone <em>optimization</em> work far enough along, Kokoro ran in real time on a MacBook that I was literally putting a laptop on the passenger seat for long drives just to have something read to me. Very inconvenient, but it made me commit to getting the phone path right. The current build isn't really tested on Mac, maybe in the future.<p>On the <em>LLM</em> tooling question up front: YES, used Claude Code and Codex throughout. I might be too much into tokenmaxxing though, since I'd run several sessions in tandem for bug hunting and several more for review to get wisdom of the crowd of sorts."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Show HN: I built on-device TTS app because I run out of audiobooks on a flight"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://loudreader.io"}},"_tags":["comment","author_mowmiatlas","story_47780234"],"author":"mowmiatlas","comment_text":"Extra context since the post got long. A few things that ate more time than I expected:<p>Streaming was the worst one. Kokoro doesn&#x27;t expose a streaming interface as far as I could find, you hand it a chunk of text, it gives you back the full audio for that chunk. For a reading app you can&#x27;t wait for a whole paragraph before playback starts, so the whole streaming layer had to be built on top. I didn&#x27;t want to process the book then serve full audio, I wanted it to be interactive.<p>The basic shape: chunk into sentence-sized windows, render in the background, queue rendered chunks for playback, keep a small pre-render lookahead so playback never starves but the phone isn&#x27;t speculatively rendering an entire chapter it might throw away on a skip.<p>Sentence chunking was its own fight. Too long and the model returns null and playback stops. Too short (four or five words at a time) and the naturalness diminishes, because the model uses context within a sentence to decide intonation. Chopped chunks sound like a bad GPS voice. I had to find the goldilocks window where the model is happy and the result still sounds good and handle long-sentence edge cases by splitting on secondary punctuation and stitching the audio back together without audible seams.<p>For battery-life there&#x27;s cruise mode. When the screen is off and the next several sentences are already rendered and cached, the app swaps the whole synthesis&#x2F;playback pipeline for a much lighter sequential AAC player, hardware-decoded audio files.<p>When the phone&#x27;s on a charger, a background task pre-renders a chapter or two of upcoming audio and writes it to disk as M4A. That way, by the time you&#x27;re actually reading, cruise mode has a cache to play from and the neural engine never has to wake up for long stretches. The system decides when to actually run the task, so it piggybacks on the phone&#x27;s usual overnight charging window.<p>The Neural Engine was a disappointment. I was hoping to get Kokoro onto the ANE for the latency&#x2F;efficiency win, seeing it works quite well on CPU, but it uses ops that CoreML doesn&#x27;t route to the Neural Engine, so it falls back to GPU&#x2F;CPU. The weird part: forcing .cpuAndNeuralEngine is actually slower than .cpuAndGPU on this model, probably partitioning cost from unsupported ops bouncing between compute units, but I don&#x27;t fully understand why. If anyone on CoreML has a principled explanation I&#x27;d love to hear it.<p>iPhone 12 mini and lower, and simulators are cursed. They seem to run Kokoro successfully, i.e. no error, inference completes but the result is pure crackling&#x2F;screeching gibberish audio. Same model, same weights, same code path. KittenTTS runs fine on the exact same hardware AND the XCode simulator. I still don&#x27;t know what&#x27;s going on here; Curious if anyone&#x27;s seen similar.<p>KittenTTS was easy. Ported it as a fallback for older devices and published a minimal iOS example repo while I was at it: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;pepinu&#x2F;KittenTTS-iOS\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;pepinu&#x2F;KittenTTS-iOS</a> if you just want to see how to get a neural TTS model running on iPhone without the full app machinery around it.<p>Before I got  the iPhone optimization work far enough along, Kokoro ran in real time on a MacBook that I was literally putting a laptop on the passenger seat for long drives just to have something read to me. Very inconvenient, but it made me commit to getting the phone path right. The current build isn&#x27;t really tested on Mac, maybe in the future.<p>On the LLM tooling question up front: YES, used Claude Code and Codex throughout. I might be too much into tokenmaxxing though, since I&#x27;d run several sessions in tandem for bug hunting and several more for review to get wisdom of the crowd of sorts.","created_at":"2026-04-15T15:20:56Z","created_at_i":1776266456,"objectID":"47780373","parent_id":47780234,"story_id":47780234,"story_title":"Show HN: I built on-device TTS app because I run out of audiobooks on a flight","story_url":"https://loudreader.io","updated_at":"2026-04-15T15:21:57Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"orbital-decay"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"You mean &quot;corporate inference infrastructure&quot;, not <em>LLMs</em>. The reason for different outputs at t=0 is mostly batching <em>optimization</em>. <em>LLMs</em> themselves are indifferent to that, you can run them in a deterministic manner any time if you don't care about optimal batching and lowest possible inference <em>cost</em>. And even then, e.g. Gemini Flash is deterministic in practice even with batching, although DeepMind doesn't strictly guarantee it.<p>This is all currently irrelevant, making it work well is a much bigger problem. As soon as there's paying demand for reproducibility, solutions will appear. This is a matter of business need, not a technical issue."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Claude mixes up who said what"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://dwyer.co.za/static/claude-mixes-up-who-said-what-and-thats-not-ok.html"}},"_tags":["comment","author_orbital-decay","story_47701233"],"author":"orbital-decay","comment_text":"You mean &quot;corporate inference infrastructure&quot;, not LLMs. The reason for different outputs at t=0 is mostly batching optimization. LLMs themselves are indifferent to that, you can run them in a deterministic manner any time if you don&#x27;t care about optimal batching and lowest possible inference cost. And even then, e.g. Gemini Flash is deterministic in practice even with batching, although DeepMind doesn&#x27;t strictly guarantee it.<p>This is all currently irrelevant, making it work well is a much bigger problem. As soon as there&#x27;s paying demand for reproducibility, solutions will appear. This is a matter of business need, not a technical issue.","created_at":"2026-04-09T17:30:10Z","created_at_i":1775755810,"objectID":"47706604","parent_id":47706184,"story_id":47701233,"story_title":"Claude mixes up who said what","story_url":"https://dwyer.co.za/static/claude-mixes-up-who-said-what-and-thats-not-ok.html","updated_at":"2026-04-09T17:33:17Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"divsh17"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"Location: Bellevue, WA<p>Remote: No<p>Willing to relocate: Yes (within the US)<p>Technologies: Python, TypeScript, FastAPI, Next.js, React, PostgreSQL, pgvector, Redis, Apache Kafka, AWS, Docker, PyTorch, <em>LLMs</em>, RAG, Prompt Engineering<p>R\u00e9sum\u00e9/CV: <a href=\"https://divyanshusharma.com\" rel=\"nofollow\">https://divyanshusharma.com</a><p>Email: divyanshusharma17.work@gmail.com<p>AI Engineer with 3 years of experience across Amazon, NYU research, and high-growth startups. Currently building Elliot AI (<a href=\"https://elliotai.app\" rel=\"nofollow\">https://elliotai.app</a>), a production AI companion with hybrid RAG memory (pgvector + SQL scoring, &lt;300ms retrieval across 6+ months of history), intent-based tool routing that cut <em>LLM</em> <em>costs</em> 68%, and voice I/O - serving 50+ daily active sessions.<p>Previously at Amazon, where I drove $2.1M+ in annualized revenue impact through A/B experimentation across microservices. Before that, scaled a startup's email platform from 400K to 1.5M+ daily sends. Published in IEEE ICIP 2021 on zero-shot action recognition.<p>Looking for AI engineer / applied AI / full-stack AI roles at startups building <em>LLM</em>-powered products. I've shipped RAG pipelines, inference <em>optimization</em>, and <em>LLM</em>-powered OCR in production. Need STEM-OPT (no sponsorship <em>cost</em> to employer).<p>GitHub: <a href=\"https://github.com/divyanshusharma1709\" rel=\"nofollow\">https://github.com/divyanshusharma1709</a>"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Ask HN: Who wants to be hired? (April 2026)"}},"_tags":["comment","author_divsh17","story_47601858"],"author":"divsh17","comment_text":"Location: Bellevue, WA<p>Remote: No<p>Willing to relocate: Yes (within the US)<p>Technologies: Python, TypeScript, FastAPI, Next.js, React, PostgreSQL, pgvector, Redis, Apache Kafka, AWS, Docker, PyTorch, LLMs, RAG, Prompt Engineering<p>R\u00e9sum\u00e9&#x2F;CV: <a href=\"https:&#x2F;&#x2F;divyanshusharma.com\" rel=\"nofollow\">https:&#x2F;&#x2F;divyanshusharma.com</a><p>Email: divyanshusharma17.work@gmail.com<p>AI Engineer with 3 years of experience across Amazon, NYU research, and high-growth startups. Currently building Elliot AI (<a href=\"https:&#x2F;&#x2F;elliotai.app\" rel=\"nofollow\">https:&#x2F;&#x2F;elliotai.app</a>), a production AI companion with hybrid RAG memory (pgvector + SQL scoring, &lt;300ms retrieval across 6+ months of history), intent-based tool routing that cut LLM costs 68%, and voice I&#x2F;O - serving 50+ daily active sessions.<p>Previously at Amazon, where I drove $2.1M+ in annualized revenue impact through A&#x2F;B experimentation across microservices. Before that, scaled a startup&#x27;s email platform from 400K to 1.5M+ daily sends. Published in IEEE ICIP 2021 on zero-shot action recognition.<p>Looking for AI engineer &#x2F; applied AI &#x2F; full-stack AI roles at startups building LLM-powered products. I&#x27;ve shipped RAG pipelines, inference optimization, and LLM-powered OCR in production. Need STEM-OPT (no sponsorship cost to employer).<p>GitHub: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;divyanshusharma1709\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;divyanshusharma1709</a>","created_at":"2026-04-05T11:33:12Z","created_at_i":1775388792,"objectID":"47648330","parent_id":47601858,"story_id":47601858,"story_title":"Ask HN: Who wants to be hired? (April 2026)","updated_at":"2026-04-05T11:37:30Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"nextime"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"<a href=\"https://pypi.org/project/aisbf/\" rel=\"nofollow\">https://pypi.org/project/aisbf/</a><p>AIsbf ( AI Should Be Frtee ) is a API proxy/router with intelligent ai driven router which exposes an openai compatible api to the clients making available to them in a unified interface different protocols and AI endpoint/services, offering various <em>optimization</em> aiming to make the <em>costs</em> of using <em>LLMs</em> more accessible to everyone.<p>It is multiuser, and can run from small setup or scale to big infrastructure.<p>In this last release:\n - support for cache on redis, sqlite, mysql, file\n - more context condensation method\n - native prompt caching and request caching support\n - faster and better semantic prompt based routing for autoselections\n - full support for Claude.ai subscribers with OAUTH2\n - full support for Amazon Kiro-cli subscribers with OAUTH2\n - full support for OpenAI codex subscribers OAUTH2\n - full support for Kilo.ai subscribers using token or OAUTH2\n - many bugfixes and new minor features"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"AIsbf (AI Should Be Free) 0.9.8 Released"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://pypi.org/project/aisbf/"}},"_tags":["comment","author_nextime","story_47640126"],"author":"nextime","comment_text":"<a href=\"https:&#x2F;&#x2F;pypi.org&#x2F;project&#x2F;aisbf&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;pypi.org&#x2F;project&#x2F;aisbf&#x2F;</a><p>AIsbf ( AI Should Be Frtee ) is a API proxy&#x2F;router with intelligent ai driven router which exposes an openai compatible api to the clients making available to them in a unified interface different protocols and AI endpoint&#x2F;services, offering various optimization aiming to make the costs of using LLMs more accessible to everyone.<p>It is multiuser, and can run from small setup or scale to big infrastructure.<p>In this last release:\n - support for cache on redis, sqlite, mysql, file\n - more context condensation method\n - native prompt caching and request caching support\n - faster and better semantic prompt based routing for autoselections\n - full support for Claude.ai subscribers with OAUTH2\n - full support for Amazon Kiro-cli subscribers with OAUTH2\n - full support for OpenAI codex subscribers OAUTH2\n - full support for Kilo.ai subscribers using token or OAUTH2\n - many bugfixes and new minor features","created_at":"2026-04-04T15:53:03Z","created_at_i":1775317983,"objectID":"47640127","parent_id":47640126,"story_id":47640126,"story_title":"AIsbf (AI Should Be Free) 0.9.8 Released","story_url":"https://pypi.org/project/aisbf/","updated_at":"2026-04-04T15:57:57Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"WecoAI"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"We did experiments comparing Optuna &amp; autoresearch.\nAutoresearch converges faster, is more <em>cost</em>-efficient, and even generalizes better.<p>Experiments were done on NanoChat: we let Claude define Optuna\u2019s search space to align the priors between methods. Both <em>optimization</em> methods were run three times. Autoresearch is far more sample-efficient on average<p>In 5 min training setting, <em>LLM</em> tokens <em>cost</em> as much as GPUs, but despite a 2\u00d7 higher per-step <em>cost</em>, AutoResearch still comes out ahead across all <em>cost</em> budgets<p>What\u2019s more, the solution found by autoresearch generalizes better than Optuna\u2019s. We gave the best solutions more training time; the absolute score gap widens, and the statistical significance becomes stronger<p>An important contributor to autoresearch\u2019s capability is that it searches directly in code space. In the early stages, autoresearch tunes knobs within Optuna\u2019s 16-parameter search space. However, with more iterations, it starts to explore code changes"},"title":{"matchLevel":"none","matchedWords":[],"value":"Show HN: Is autoresearch better than classic hyperparameter tuning?"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://www.weco.ai/blog/autoresearch-vs-classical-hpo"}},"_tags":["story","author_WecoAI","story_47620196","show_hn"],"author":"WecoAI","children":[47658420],"created_at":"2026-04-02T21:07:03Z","created_at_i":1775164023,"num_comments":1,"objectID":"47620196","points":3,"story_id":47620196,"story_text":"We did experiments comparing Optuna &amp; autoresearch.\nAutoresearch converges faster, is more cost-efficient, and even generalizes better.<p>Experiments were done on NanoChat: we let Claude define Optuna\u2019s search space to align the priors between methods. Both optimization methods were run three times. Autoresearch is far more sample-efficient on average<p>In 5 min training setting, LLM tokens cost as much as GPUs, but despite a 2\u00d7 higher per-step cost, AutoResearch still comes out ahead across all cost budgets<p>What\u2019s more, the solution found by autoresearch generalizes better than Optuna\u2019s. We gave the best solutions more training time; the absolute score gap widens, and the statistical significance becomes stronger<p>An important contributor to autoresearch\u2019s capability is that it searches directly in code space. In the early stages, autoresearch tunes knobs within Optuna\u2019s 16-parameter search space. However, with more iterations, it starts to explore code changes","title":"Show HN: Is autoresearch better than classic hyperparameter tuning?","updated_at":"2026-04-06T08:54:18Z","url":"https://www.weco.ai/blog/autoresearch-vs-classical-hpo"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"AceJohnny2"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"&gt; <i>it's basically a <em>cost</em> <em>optimization</em> masquerading as a feature</i><p><em>Cost</em> <em>optimization</em> <i>in the user's favor</i>.<p>Remember that every time you send a new message to the <em>LLM</em>, you are actually sending the <i>entire conversation</i> again with that added last message to the <em>LLM</em>.<p>Remember that LLMs are fixed functions, the only variable is the context input (and temperature, sure).<p>Naively, this would lead to quadratic consumption of your token quota, which would get ridiculously expensive as conversations stretch into current 100k-1M context windows.<p>To solve this, AI providers cache the context on the GPU, and only charge you for the delta in the conversation/context. But they're not going to keep that GPU cache warm for you forever, so it'll time out after some inactivity.<p>So the microcompaction-on-idle happens to soften the token consumption blow after you've stepped away for lunch, your context cache has been flushed by the AI provider, and you basically have to spend tokens to restart your conversation from scratch."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"The Claude Code Source Leak: fake tools, frustration regexes, undercover mode"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://alex000kim.com/posts/2026-03-31-claude-code-source-leak/"}},"_tags":["comment","author_AceJohnny2","story_47586778"],"author":"AceJohnny2","comment_text":"&gt; <i>it&#x27;s basically a cost optimization masquerading as a feature</i><p>Cost optimization <i>in the user&#x27;s favor</i>.<p>Remember that every time you send a new message to the LLM, you are actually sending the <i>entire conversation</i> again with that added last message to the LLM.<p>Remember that LLMs are fixed functions, the only variable is the context input (and temperature, sure).<p>Naively, this would lead to quadratic consumption of your token quota, which would get ridiculously expensive as conversations stretch into current 100k-1M context windows.<p>To solve this, AI providers cache the context on the GPU, and only charge you for the delta in the conversation&#x2F;context. But they&#x27;re not going to keep that GPU cache warm for you forever, so it&#x27;ll time out after some inactivity.<p>So the microcompaction-on-idle happens to soften the token consumption blow after you&#x27;ve stepped away for lunch, your context cache has been flushed by the AI provider, and you basically have to spend tokens to restart your conversation from scratch.","created_at":"2026-04-02T17:04:57Z","created_at_i":1775149497,"objectID":"47617089","parent_id":47611309,"story_id":47586778,"story_title":"The Claude Code Source Leak: fake tools, frustration regexes, undercover mode","story_url":"https://alex000kim.com/posts/2026-03-31-claude-code-source-leak/","updated_at":"2026-04-03T20:17:40Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"pithtoken"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"I'm in a similar situation \u2014 building a SaaS (API proxy for <em>LLM</em> <em>cost</em> <em>optimization</em>) and evaluating Estonia vs UK LTD. Leaning towards UK for Stripe compatibility but the 0% tax on retained profits in Estonia is very tempting. Would love to hear from anyone who's used both."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Ask HN: Founders of estonian e-businesses \u2013 is it worth it?"}},"_tags":["comment","author_pithtoken","story_47500239"],"author":"pithtoken","children":[47553442],"comment_text":"I&#x27;m in a similar situation \u2014 building a SaaS (API proxy for LLM cost optimization) and evaluating Estonia vs UK LTD. Leaning towards UK for Stripe compatibility but the 0% tax on retained profits in Estonia is very tempting. Would love to hear from anyone who&#x27;s used both.","created_at":"2026-03-28T10:17:20Z","created_at_i":1774693040,"objectID":"47553237","parent_id":47500239,"story_id":47500239,"story_title":"Ask HN: Founders of estonian e-businesses \u2013 is it worth it?","updated_at":"2026-03-28T10:58:26Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"engradient"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"I'm an undergrad with no research affiliation. I've been thinking about why <em>LLM</em> training is so expensive and why continuous learning remains unsolved. This post is where that thinking led \u2014 a concrete architecture proposal with a cheap falsifiable experiment at its core.<p>The Core Idea (30 seconds)\nCatastrophic forgetting \u2014 when fine-tuning a model on new knowledge destroys old knowledge \u2014 is universally treated as a problem to minimize.\nI think it's a measurement instrument.\nThe forgetting test: take two small specialist models trained on different domains. Run a short joint fine-tuning pass to put them in the same parameter space. Then train on domain A and measure degradation in domain B.<p>Low degradation \u2192 the domains share deep parametric structure \u2192 they should be one module\nHigh degradation \u2192 the domains are parametrically independent \u2192 they should stay separate<p>No human ontology required. The signal itself reveals the intrinsic topology of knowledge.<p>Why This Matters: The Architecture It Enables\nIf forgetting is a probe, you can build a system that uses it to self-organize:\nAtomic specialist modules \u2014 not &quot;mathematics&quot; as one model, but differential calculus, integral calculus, linear algebra as separate models. Each has its own isolated parameter space, trained independently, updatable without affecting others.\nAn Orchestrator \u2014 one model whose only job is routing: given a task, which modules to call, in what order. Trained purely on routing performance, not subject matter. Also responsible for evaluating output quality, scheduling retraining of weak modules, and spawning new modules when a domain isn't covered.\nThe forgetting test as the merge criterion \u2014 the system continuously runs pairwise tests on candidate module pairs. Modules that pass get merged. Modules that fail stay separate. Granularity emerges from the data, not from human taxonomy.\nThis is not MoE. In MoE, experts are trained jointly in a single run and share an <em>optimization</em> process. Here, modules are genuinely independent \u2014 separate training runs, separate parameter spaces, separate update cycles.<p>The Falsifiable Experiment\nThe entire proposal rests on one empirical bet:<p>Catastrophic forgetting magnitude between two domains predicts whether those domains share intrinsic parametric structure.<p>How to test it:<p>Train ~20 small models (3B params each) on narrow mathematical subdomains\nFor each candidate pair: run joint fine-tuning, then train on domain A, measure degradation in domain B\nBuild a topology graph from the degradation matrix\nCompare to human intuitions about mathematical structure<p><em>Cost</em>: ~100 H800-days. Cheap enough to run before committing to the full architecture.\nIf the forgetting-derived topology matches (or interestingly contradicts) human intuitions about knowledge structure, the probe is real. If it's noise, the approach needs fundamental revision.<p>What's Novel Here\nExisting continual learning work: minimize forgetting.\nThis: use forgetting magnitude as a structured signal.\nExisting modular <em>LLM</em> work: MoE variants, multi-agent systems with generalist agents.\nThis: genuinely independent atomic specialists with emergent boundaries.\nExisting orchestration: route tasks to agents.\nThis: orchestrator also manages module lifecycle \u2014 retraining, spawning, retiring \u2014 based on forgetting test results.\nThe combination hasn't been done. Each piece exists in isolation. The forgetting-as-probe insight is what connects them.<p>Open Problems (Being Honest)<p>Orchestrator evaluation without ground truth: for open-ended tasks, how does the Orchestrator assess quality? Hypothesis: inter-module consistency as a proxy signal.\nModule splitting: merged modules may need splitting as knowledge evolves. Detection mechanism unclear.\nSafety under self-modification: addressed via a frozen &quot;tuple layer&quot; \u2014 architecturally inaccessible parameters encoding invariant constraints."},"title":{"matchLevel":"none","matchedWords":[],"value":"Using Catastrophic Forgetting as a Knowledge Topology Probe"}},"_tags":["story","author_engradient","story_47552974","ask_hn"],"author":"engradient","children":[47663766],"created_at":"2026-03-28T09:22:16Z","created_at_i":1774689736,"num_comments":0,"objectID":"47552974","points":2,"story_id":47552974,"story_text":"I&#x27;m an undergrad with no research affiliation. I&#x27;ve been thinking about why LLM training is so expensive and why continuous learning remains unsolved. This post is where that thinking led \u2014 a concrete architecture proposal with a cheap falsifiable experiment at its core.<p>The Core Idea (30 seconds)\nCatastrophic forgetting \u2014 when fine-tuning a model on new knowledge destroys old knowledge \u2014 is universally treated as a problem to minimize.\nI think it&#x27;s a measurement instrument.\nThe forgetting test: take two small specialist models trained on different domains. Run a short joint fine-tuning pass to put them in the same parameter space. Then train on domain A and measure degradation in domain B.<p>Low degradation \u2192 the domains share deep parametric structure \u2192 they should be one module\nHigh degradation \u2192 the domains are parametrically independent \u2192 they should stay separate<p>No human ontology required. The signal itself reveals the intrinsic topology of knowledge.<p>Why This Matters: The Architecture It Enables\nIf forgetting is a probe, you can build a system that uses it to self-organize:\nAtomic specialist modules \u2014 not &quot;mathematics&quot; as one model, but differential calculus, integral calculus, linear algebra as separate models. Each has its own isolated parameter space, trained independently, updatable without affecting others.\nAn Orchestrator \u2014 one model whose only job is routing: given a task, which modules to call, in what order. Trained purely on routing performance, not subject matter. Also responsible for evaluating output quality, scheduling retraining of weak modules, and spawning new modules when a domain isn&#x27;t covered.\nThe forgetting test as the merge criterion \u2014 the system continuously runs pairwise tests on candidate module pairs. Modules that pass get merged. Modules that fail stay separate. Granularity emerges from the data, not from human taxonomy.\nThis is not MoE. In MoE, experts are trained jointly in a single run and share an optimization process. Here, modules are genuinely independent \u2014 separate training runs, separate parameter spaces, separate update cycles.<p>The Falsifiable Experiment\nThe entire proposal rests on one empirical bet:<p>Catastrophic forgetting magnitude between two domains predicts whether those domains share intrinsic parametric structure.<p>How to test it:<p>Train ~20 small models (3B params each) on narrow mathematical subdomains\nFor each candidate pair: run joint fine-tuning, then train on domain A, measure degradation in domain B\nBuild a topology graph from the degradation matrix\nCompare to human intuitions about mathematical structure<p>Cost: ~100 H800-days. Cheap enough to run before committing to the full architecture.\nIf the forgetting-derived topology matches (or interestingly contradicts) human intuitions about knowledge structure, the probe is real. If it&#x27;s noise, the approach needs fundamental revision.<p>What&#x27;s Novel Here\nExisting continual learning work: minimize forgetting.\nThis: use forgetting magnitude as a structured signal.\nExisting modular LLM work: MoE variants, multi-agent systems with generalist agents.\nThis: genuinely independent atomic specialists with emergent boundaries.\nExisting orchestration: route tasks to agents.\nThis: orchestrator also manages module lifecycle \u2014 retraining, spawning, retiring \u2014 based on forgetting test results.\nThe combination hasn&#x27;t been done. Each piece exists in isolation. The forgetting-as-probe insight is what connects them.<p>Open Problems (Being Honest)<p>Orchestrator evaluation without ground truth: for open-ended tasks, how does the Orchestrator assess quality? Hypothesis: inter-module consistency as a proxy signal.\nModule splitting: merged modules may need splitting as knowledge evolves. Detection mechanism unclear.\nSafety under self-modification: addressed via a frozen &quot;tuple layer&quot; \u2014 architecturally inaccessible parameters encoding invariant constraints.","title":"Using Catastrophic Forgetting as a Knowledge Topology Probe","updated_at":"2026-04-06T17:12:05Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"TheGRS"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"Validation efforts likely become more necessary, so <em>costs</em> rise in another area. And product managers find they still need someone to translate the requirements well because <em>LLMs</em> are too agreeable. <em>Cost</em> <em>optimization</em> still needs someone to intervene as well.<p>I know there's an attempt to shift the development part from developers to other laypeople, but I think that's just going to frustrate everyone involved and probably settle back down into technical roles again. Well paid? Unclear."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Why are executives enamored with AI, but ICs aren't?"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://johnjwang.com/post/2026/03/27/why-are-executives-enabled-with-ai-but-ics-arent/"}},"_tags":["comment","author_TheGRS","story_47549649"],"author":"TheGRS","comment_text":"Validation efforts likely become more necessary, so costs rise in another area. And product managers find they still need someone to translate the requirements well because LLMs are too agreeable. Cost optimization still needs someone to intervene as well.<p>I know there&#x27;s an attempt to shift the development part from developers to other laypeople, but I think that&#x27;s just going to frustrate everyone involved and probably settle back down into technical roles again. Well paid? Unclear.","created_at":"2026-03-28T00:08:21Z","created_at_i":1774656501,"objectID":"47550061","parent_id":47549801,"story_id":47549649,"story_title":"Why are executives enamored with AI, but ICs aren't?","story_url":"https://johnjwang.com/post/2026/03/27/why-are-executives-enabled-with-ai-but-ics-arent/","updated_at":"2026-03-28T02:28:11Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"aeonfox"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"I was thinking more if you write a prompt into an IDE that has first-party integration with an <em>LLM</em> platform (e.g. VS Code with Github Copilot), it would make sense on their end to reduce and remove redundant input before ingesting the token into their models, just to increase throughput (increase customers) and decrease latency (reduce <em>costs</em>).  They would be foolish <i>not</i> to do this kind of <em>optimisation</em>, so surely they must be doing it.  Whether they would pass on those token savings to the user, I couldn't say."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Show HN: Nit \u2013 I rebuilt Git in Zig to save AI agents 71% on tokens"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://justfielding.com/blog/nit-replacing-git-with-zig"}},"_tags":["comment","author_aeonfox","story_47526276"],"author":"aeonfox","comment_text":"I was thinking more if you write a prompt into an IDE that has first-party integration with an LLM platform (e.g. VS Code with Github Copilot), it would make sense on their end to reduce and remove redundant input before ingesting the token into their models, just to increase throughput (increase customers) and decrease latency (reduce costs).  They would be foolish <i>not</i> to do this kind of optimisation, so surely they must be doing it.  Whether they would pass on those token savings to the user, I couldn&#x27;t say.","created_at":"2026-03-27T09:00:24Z","created_at_i":1774602024,"objectID":"47540408","parent_id":47527434,"story_id":47526276,"story_title":"Show HN: Nit \u2013 I rebuilt Git in Zig to save AI agents 71% on tokens","story_url":"https://justfielding.com/blog/nit-replacing-git-with-zig","updated_at":"2026-03-27T09:05:37Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"indiegoing"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"For one of my projects, I needed to choose an <em>LLM</em> but got lost in numbers and tokenization. So I searched for a solution which could help me do the math, but only found tools that helped with <em>cost</em> management and <em>optimization</em> at the production stage. I did some research and found that this is an existing problem, especially if you are a vibe-coder or solo developer starting an AI-powered app from scratch.<p>So I built an MVP to test with you guys \u2014 if any of you relate to the problem, please tell me what works and what's missing. Already included retries, prompt caching, batch discounts, and 3\u00d7/10\u00d7 growth scenarios across 6 models (GPT-4o, Claude, Gemini, DeepSeek and more). Also  the app models full architecture, the user just needs to pick an app type and set the usage pattern."},"title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["llm","cost"],"value":"Show HN: //Beforeyouship is a pre-build tool to estimate the <em>LLM</em> <em>cost</em>"},"url":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["llm","cost"],"value":"https://<em>llm</em>-architecture-<em>cost</em>-modeler.vercel.app/"}},"_tags":["story","author_indiegoing","story_47528581","show_hn"],"author":"indiegoing","children":[47685820,47529163,47529484],"created_at":"2026-03-26T10:10:59Z","created_at_i":1774519859,"num_comments":5,"objectID":"47528581","points":3,"story_id":47528581,"story_text":"For one of my projects, I needed to choose an LLM but got lost in numbers and tokenization. So I searched for a solution which could help me do the math, but only found tools that helped with cost management and optimization at the production stage. I did some research and found that this is an existing problem, especially if you are a vibe-coder or solo developer starting an AI-powered app from scratch.<p>So I built an MVP to test with you guys \u2014 if any of you relate to the problem, please tell me what works and what&#x27;s missing. Already included retries, prompt caching, batch discounts, and 3\u00d7&#x2F;10\u00d7 growth scenarios across 6 models (GPT-4o, Claude, Gemini, DeepSeek and more). Also  the app models full architecture, the user just needs to pick an app type and set the usage pattern.","title":"Show HN: //Beforeyouship is a pre-build tool to estimate the LLM cost","updated_at":"2026-04-08T05:45:41Z","url":"https://llm-architecture-cost-modeler.vercel.app/"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"samherder"},"story_text":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["llm"],"value":"I built Genosis because my AI trading assistant's Anthropic bill was eating the project alive \u2014 12% cache hit rate when it should have been 80%, and I was spending more time optimizing costs than building the actual product.<p>Every major <em>LLM</em> provider offers 50-90% discounts on cached tokens, but the mechanics to actually capture them are different for every provider, change regularly, and are genuinely hard to get right.<p>Genosis watches your traffic (content-blind \u2014 it only sees hashes, never your data), figures out which blocks are worth caching and in what order, and delivers a manifest that the SDK applies locally. It also catches duplicate requests and serves them from a local cache \u2014 saving both input and output tokens.<p>It's not a proxy. It's never in your request path. If it goes down, your app works exactly as if it was never there.<p>- Open-source SDK: Python (pip install genosis) and TypeScript (npm install @genosis/sdk)\n- Supports Anthropic and OpenAI (Google coming soon)\n- Free tier \u2014 you only pay if we save you money<p>I'm the sole founder. Happy to answer questions about how it works, the caching mechanics, or anything else."},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"Show HN: Genosis \u2013 <em>LLM</em> <em>cost</em> <em>optimization</em> that learns from your traffic"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://usegenosis.ai/"}},"_tags":["story","author_samherder","story_47516438","show_hn"],"author":"samherder","children":[47645712,47582787],"created_at":"2026-03-25T12:28:18Z","created_at_i":1774441698,"num_comments":0,"objectID":"47516438","points":2,"story_id":47516438,"story_text":"I built Genosis because my AI trading assistant&#x27;s Anthropic bill was eating the project alive \u2014 12% cache hit rate when it should have been 80%, and I was spending more time optimizing costs than building the actual product.<p>Every major LLM provider offers 50-90% discounts on cached tokens, but the mechanics to actually capture them are different for every provider, change regularly, and are genuinely hard to get right.<p>Genosis watches your traffic (content-blind \u2014 it only sees hashes, never your data), figures out which blocks are worth caching and in what order, and delivers a manifest that the SDK applies locally. It also catches duplicate requests and serves them from a local cache \u2014 saving both input and output tokens.<p>It&#x27;s not a proxy. It&#x27;s never in your request path. If it goes down, your app works exactly as if it was never there.<p>- Open-source SDK: Python (pip install genosis) and TypeScript (npm install @genosis&#x2F;sdk)\n- Supports Anthropic and OpenAI (Google coming soon)\n- Free tier \u2014 you only pay if we save you money<p>I&#x27;m the sole founder. Happy to answer questions about how it works, the caching mechanics, or anything else.","title":"Show HN: Genosis \u2013 LLM cost optimization that learns from your traffic","updated_at":"2026-04-05T02:52:28Z","url":"https://usegenosis.ai/"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"sannysanoff"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"I just took some legacy java code. You know that one, well engineered, extensible, with multiple cases accumulated with years, special cases, kinda slower than it was initially, but still strong for production. Some good ideas were born after it went live, but nobody will break working piece of code.<p>Yes, it was kinda refactoring for bigger client and his use case with bigger data. I implemented slightly different order in input data, so it now does work more serially than parallel, yes, it takes less memory, and is more cache-friendly. But that's not all.<p>I also moved it from row-oriented CSV to column oriented. You say good refactor! Yeah, i think so, too. Yes, that uses Clickhouse as a DB. Native Clickhouse column-oriented wire format vs any human oriented format. Yes, I gave it Clickhouse C++ source, because format however stable, is not documented in the &quot;documentation&quot;. Yes, it created custom-tailored serializer and de-serializer, including dictionaries (low cardinality columns). Yes, i explained what I expect from it.<p>Yes, I asked <em>LLM</em> to not implement data transfer objects. Instead, it reads directly from Clickhouse native wire format, without allocations, and writes Clickhouse native wire format, without allocations. It allocates slightly when processing data itself, but I optimized it too.<p>Code did few passes on data, I asked <em>LLM</em> to perform loop fusion and do all in one pass, because, as a human, that would complicate code, and was not done before, but the client is important, you know.<p>It contained some suboptimal data layout IN ITS CORE, too. I played and measured several layouts, by just changing it (adding/removing few indirections here and there), all code using this CORE DATA was adapted automatically, so I did quite a few iterations evaluating best one during that morning.<p>Already efficient code became 20 times faster. Not because it was not efficient. But because it was legacy, human oriented, well designed, and it worked, it was java, even  with fast parser/processor <em>optimizations</em>, reduced allocations etc - it was long maintained, it was an asset.<p>I just applied some transformations to it, in mostly automatic way. Yes, I can do that again.<p>This is called supercompilation, guys. It can be automated these days. Legacy is original generic program. Like they synthesized rocket engine using AI (see news and pictures circa year ago), we can synthesize supercompiled programs from single legacy source, given various boundary conditions.<p>My congratulations!<p>NB Supercompilation is not a cool word, it's quite old concept in IT, for example see https://sites.google.com/site/keldyshscp/Home/supercompilerconcept<p>PS on the &quot;this code is liability now&quot;. Nope. <em>Cost</em> of maintaining this code TODAY is radically lower than ever before."},"title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["llm"],"value":"<em>LLM</em> Can Be a Supercompiler"}},"_tags":["story","author_sannysanoff","story_47497411","ask_hn"],"author":"sannysanoff","created_at":"2026-03-24T01:07:16Z","created_at_i":1774314436,"num_comments":0,"objectID":"47497411","points":1,"story_id":47497411,"story_text":"I just took some legacy java code. You know that one, well engineered, extensible, with multiple cases accumulated with years, special cases, kinda slower than it was initially, but still strong for production. Some good ideas were born after it went live, but nobody will break working piece of code.<p>Yes, it was kinda refactoring for bigger client and his use case with bigger data. I implemented slightly different order in input data, so it now does work more serially than parallel, yes, it takes less memory, and is more cache-friendly. But that&#x27;s not all.<p>I also moved it from row-oriented CSV to column oriented. You say good refactor! Yeah, i think so, too. Yes, that uses Clickhouse as a DB. Native Clickhouse column-oriented wire format vs any human oriented format. Yes, I gave it Clickhouse C++ source, because format however stable, is not documented in the &quot;documentation&quot;. Yes, it created custom-tailored serializer and de-serializer, including dictionaries (low cardinality columns). Yes, i explained what I expect from it.<p>Yes, I asked LLM to not implement data transfer objects. Instead, it reads directly from Clickhouse native wire format, without allocations, and writes Clickhouse native wire format, without allocations. It allocates slightly when processing data itself, but I optimized it too.<p>Code did few passes on data, I asked LLM to perform loop fusion and do all in one pass, because, as a human, that would complicate code, and was not done before, but the client is important, you know.<p>It contained some suboptimal data layout IN ITS CORE, too. I played and measured several layouts, by just changing it (adding&#x2F;removing few indirections here and there), all code using this CORE DATA was adapted automatically, so I did quite a few iterations evaluating best one during that morning.<p>Already efficient code became 20 times faster. Not because it was not efficient. But because it was legacy, human oriented, well designed, and it worked, it was java, even  with fast parser&#x2F;processor optimizations, reduced allocations etc - it was long maintained, it was an asset.<p>I just applied some transformations to it, in mostly automatic way. Yes, I can do that again.<p>This is called supercompilation, guys. It can be automated these days. Legacy is original generic program. Like they synthesized rocket engine using AI (see news and pictures circa year ago), we can synthesize supercompiled programs from single legacy source, given various boundary conditions.<p>My congratulations!<p>NB Supercompilation is not a cool word, it&#x27;s quite old concept in IT, for example see https:&#x2F;&#x2F;sites.google.com&#x2F;site&#x2F;keldyshscp&#x2F;Home&#x2F;supercompilerconcept<p>PS on the &quot;this code is liability now&quot;. Nope. Cost of maintaining this code TODAY is radically lower than ever before.","title":"LLM Can Be a Supercompiler","updated_at":"2026-03-24T01:13:53Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"adampunk"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"Here's one possibility: Anthropic understands the value of the brand and the harness and that those two things are connected, specifically because they came from behind. OpenAI almost accidentally launched a global brand overnight. ChatGPT went from nothing to the kind of english word you hear in non-anglophone countries in about a month. Millions and millions have used it (at least once) and more people associate it with AI than use it. OpenAI's problem is managing the big industry links so that by the time the hype cools down, they're already plugged into tools. Their &quot;moat&quot; is that number of companies that matter is actually small and all those companies like predictable, enterprise shaped solutions with contracts and stuff. Unlike developers who might switch their subscriptions quickly and absorb the productivity <em>cost</em> of switching (or minimize that <em>cost</em>), these big companies don't want to be constantly optimizing compute vs rental rate. They want to convert an unruly value (programmer productivity) to something easy, not replace it with a scheduling or <em>optimization</em> problem.<p>That was working ok until Claude, specifically Claude Code showed up. This was a really useful code-writing harness (that also signed your commits, advertising itself to everyone) that took what are essentially very similar models and made Opus feel like the future of software while GPT 5.2 and friends are just code agents. The performance, ability to handle long term tasks, all of that was basically similar but the harness oriented the model to reason, shell out sub-agents, write scratch code, add console logs, all the sorts of things that 1. seem like science fiction, and 2. improve output a little. Then from fall of last year to no you don't have developers saying &quot;look what I made with <em>LLMs</em>&quot; or &quot;Look what I made with AI&quot; but &quot;Look what I did with Claude&quot;. There are not very many blog posts out there about the future of software being re-written due to GPT 5.2 getting autocompaction, but that same feature spawned thousands of &quot;oh shit!&quot; posts in Claude.<p>That's not a more defensible moat than name recognition + small N for customers. It's a scarier position because if someone else figures out how to deliver the same result (Opus +  sonnet + Haiku in a managed ensemble) in a way that was sharp and viral, the same thing they did to OpenAI could happen to them. They still supply the compute but the fact that anyone gives a shit about them is their harness makes it look like more and better code is being written. If that's your situation, you gently write the OpenClaw guy, you threaten to cut off and sue OpenCode for using subscription sign-in. You don't do those things because of a numerator/denominator problem with token <em>cost</em> and monthly fees. You do it because someone using your models in a better harness is a clear brand threat."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Anthropic takes legal action against OpenCode"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/anomalyco/opencode/pull/18186"}},"_tags":["comment","author_adampunk","story_47444748"],"author":"adampunk","comment_text":"Here&#x27;s one possibility: Anthropic understands the value of the brand and the harness and that those two things are connected, specifically because they came from behind. OpenAI almost accidentally launched a global brand overnight. ChatGPT went from nothing to the kind of english word you hear in non-anglophone countries in about a month. Millions and millions have used it (at least once) and more people associate it with AI than use it. OpenAI&#x27;s problem is managing the big industry links so that by the time the hype cools down, they&#x27;re already plugged into tools. Their &quot;moat&quot; is that number of companies that matter is actually small and all those companies like predictable, enterprise shaped solutions with contracts and stuff. Unlike developers who might switch their subscriptions quickly and absorb the productivity cost of switching (or minimize that cost), these big companies don&#x27;t want to be constantly optimizing compute vs rental rate. They want to convert an unruly value (programmer productivity) to something easy, not replace it with a scheduling or optimization problem.<p>That was working ok until Claude, specifically Claude Code showed up. This was a really useful code-writing harness (that also signed your commits, advertising itself to everyone) that took what are essentially very similar models and made Opus feel like the future of software while GPT 5.2 and friends are just code agents. The performance, ability to handle long term tasks, all of that was basically similar but the harness oriented the model to reason, shell out sub-agents, write scratch code, add console logs, all the sorts of things that 1. seem like science fiction, and 2. improve output a little. Then from fall of last year to no you don&#x27;t have developers saying &quot;look what I made with LLMs&quot; or &quot;Look what I made with AI&quot; but &quot;Look what I did with Claude&quot;. There are not very many blog posts out there about the future of software being re-written due to GPT 5.2 getting autocompaction, but that same feature spawned thousands of &quot;oh shit!&quot; posts in Claude.<p>That&#x27;s not a more defensible moat than name recognition + small N for customers. It&#x27;s a scarier position because if someone else figures out how to deliver the same result (Opus +  sonnet + Haiku in a managed ensemble) in a way that was sharp and viral, the same thing they did to OpenAI could happen to them. They still supply the compute but the fact that anyone gives a shit about them is their harness makes it look like more and better code is being written. If that&#x27;s your situation, you gently write the OpenClaw guy, you threaten to cut off and sue OpenCode for using subscription sign-in. You don&#x27;t do those things because of a numerator&#x2F;denominator problem with token cost and monthly fees. You do it because someone using your models in a better harness is a clear brand threat.","created_at":"2026-03-19T23:26:18Z","created_at_i":1773962778,"objectID":"47447912","parent_id":47446956,"story_id":47444748,"story_title":"Anthropic takes legal action against OpenCode","story_url":"https://github.com/anomalyco/opencode/pull/18186","updated_at":"2026-03-21T19:41:46Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"bluGill"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["llm","cost","optimization"],"value":"Facebook had talks already years ago (10+) - nobody was allowed to share real numbers, but several facebook employed where allowed to share that the company has measured savings from <em>optimizations</em>.  Reading between the lines, a 0.1% efficiency improvement to some parts of Facebook would save them $100,000 a month (again real numbers were never publicly shared so there is a range - it can't be less than $20,000), and so they had teams of people whose job it was to find those improvements.<p>Most of the savings seemed to come from HVAC <em>costs</em>, followed by buying less computers and in turn less data centers.  I'm sure these days saving memory is also a big deal but it doesn't seem to have been then.<p>The above was already the case 10 years ago, so <em>LLMs</em> are at most another factor added on."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Meta\u2019s renewed commitment to jemalloc"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://engineering.fb.com/2026/03/02/data-infrastructure/investing-in-infrastructure-metas-renewed-commitment-to-jemalloc/"}},"_tags":["comment","author_bluGill","story_47402640"],"author":"bluGill","children":[47405126,47405642,47404732,47404761],"comment_text":"Facebook had talks already years ago (10+) - nobody was allowed to share real numbers, but several facebook employed where allowed to share that the company has measured savings from optimizations.  Reading between the lines, a 0.1% efficiency improvement to some parts of Facebook would save them $100,000 a month (again real numbers were never publicly shared so there is a range - it can&#x27;t be less than $20,000), and so they had teams of people whose job it was to find those improvements.<p>Most of the savings seemed to come from HVAC costs, followed by buying less computers and in turn less data centers.  I&#x27;m sure these days saving memory is also a big deal but it doesn&#x27;t seem to have been then.<p>The above was already the case 10 years ago, so LLMs are at most another factor added on.","created_at":"2026-03-16T19:57:39Z","created_at_i":1773691059,"objectID":"47404027","parent_id":47403015,"story_id":47402640,"story_title":"Meta\u2019s renewed commitment to jemalloc","story_url":"https://engineering.fb.com/2026/03/02/data-infrastructure/investing-in-infrastructure-metas-renewed-commitment-to-jemalloc/","updated_at":"2026-03-24T23:08:43Z"}],"hitsPerPage":20,"nbHits":291,"nbPages":15,"page":0,"params":"query=LLM+cost+optimization&advancedSyntax=true&analyticsTags=backend","processingTimeMS":17,"processingTimingsMS":{"_request":{"roundTrip":21},"afterFetch":{"format":{"highlighting":2,"total":2}},"fetch":{"query":10,"scanning":5,"total":16},"total":17},"query":"LLM cost optimization","serverTimeMS":19}