{"exhaustive":{"nbHits":false,"typo":false},"exhaustiveNbHits":false,"exhaustiveTypo":false,"hits":[{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"drakenot"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["moe","mixture","of","experts","inference"],"value":"(Summary from Reddit)<p>- fp8 instead <em>of</em> fp32 precision training = 75% less memory<p>- multi-token prediction to vastly speed up token output<p>- <em>Mixture</em> <em>of</em> <em>Experts</em> (<em>MoE</em>) so that <em>inference</em> only uses parts <em>of</em> the model not the - entire model (~37B active at a time, not the entire 671B), increases efficiency<p>- PTX (basically low-level assembly code) hacking in old Nvidia GPUs to pump out as much performance from their old H800 GPUs as possible<p>Then, the big innovation <em>of</em> R1 and R1-Zero was finding a way to utilize reinforcement learning within their LLM training."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Why OpenAI's $157B valuation misreads AI's future (Oct 2024)"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://foundationcapital.com/why-openais-157b-valuation-misreads-ais-future/"}},"_tags":["comment","author_drakenot","story_42847825"],"author":"drakenot","children":[42849741],"comment_text":"(Summary from Reddit)<p>- fp8 instead of fp32 precision training = 75% less memory<p>- multi-token prediction to vastly speed up token output<p>- Mixture of Experts (MoE) so that inference only uses parts of the model not the - entire model (~37B active at a time, not the entire 671B), increases efficiency<p>- PTX (basically low-level assembly code) hacking in old Nvidia GPUs to pump out as much performance from their old H800 GPUs as possible<p>Then, the big innovation of R1 and R1-Zero was finding a way to utilize reinforcement learning within their LLM training.","created_at":"2025-01-28T05:36:33Z","created_at_i":1738042593,"objectID":"42849195","parent_id":42849151,"story_id":42847825,"story_title":"Why OpenAI's $157B valuation misreads AI's future (Oct 2024)","story_url":"https://foundationcapital.com/why-openais-157b-valuation-misreads-ais-future/","updated_at":"2025-01-30T12:36:43Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"luc4sdreyer"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["moe","mixture","of","experts","inference"],"value":"I'm not aware <em>of</em> anything concrete by OpenAI, but others have offered possible explanations.<p>One idea is that the cause is batched <em>inference</em> in sparse <em>MoE</em> (<em>mixture</em> <em>of</em> <em>experts</em>) models.<p><a href=\"https://152334h.github.io/blog/non-determinism-in-gpt-4/\" rel=\"nofollow noreferrer\">https://152334h.github.io/blog/non-determinism-in-gpt-4/</a><p>HN discussion: <a href=\"https://news.ycombinator.com/item?id=37006224\">https://news.ycombinator.com/item?id=37006224</a>"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"GPT Unicorn has drawn a unicorn"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://gpt-unicorn.adamkdean.co.uk/"}},"_tags":["comment","author_luc4sdreyer","story_37073701"],"author":"luc4sdreyer","children":[37074757],"comment_text":"I&#x27;m not aware of anything concrete by OpenAI, but others have offered possible explanations.<p>One idea is that the cause is batched inference in sparse MoE (mixture of experts) models.<p><a href=\"https:&#x2F;&#x2F;152334h.github.io&#x2F;blog&#x2F;non-determinism-in-gpt-4&#x2F;\" rel=\"nofollow noreferrer\">https:&#x2F;&#x2F;152334h.github.io&#x2F;blog&#x2F;non-determinism-in-gpt-4&#x2F;</a><p>HN discussion: <a href=\"https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37006224\">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37006224</a>","created_at":"2023-08-10T10:39:12Z","created_at_i":1691663952,"objectID":"37074331","parent_id":37074264,"story_id":37073701,"story_title":"GPT Unicorn has drawn a unicorn","story_url":"https://gpt-unicorn.adamkdean.co.uk/","updated_at":"2024-09-20T14:52:37Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"liuwei"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["moe","mixture","of","experts","inference"],"value":"Hey HN! I built WanVideo (wanvideo.tv) to make the powerful Wan 2.2 video generation model accessible to everyone.<p>## What is Wan 2.2?<p>Wan 2.2 is an advanced AI video generation model that uses an optimized <em>MoE</em> (<em>Mixture</em> <em>of</em> <em>Experts</em>) architecture to create professional-quality videos. Key capabilities:<p>- *Cinema-Grade Aesthetics*: Professional lighting, smooth camera movements, and cinematic effects without any filming equipment\n- *Natural Motion Dynamics*: Realistic human movements, accurate physics simulation, and believable character expressions\n- *Multi-Modal Input*: Generate from text descriptions or animate existing images\n- *Lightning Fast*: 5-10 minute generation (vs hours with other models)<p>## Why We Chose Wan 2.2<p>After testing multiple video generation models, Wan 2.2 stood out for its balance <em>of</em> quality and efficiency. The <em>MoE</em> architecture allows it to specialize different aspects <em>of</em> video generation (motion, lighting, scene composition) while maintaining fast <em>inference</em> times.<p>## Technical Implementation<p>- Built on Next.js 14 + Supabase + Cloudflare Pages\n- Provider abstraction layer supporting Wan 2.2 via Kie.ai\n- Credit-based billing ($0.25 = 10 credits)\n- Webhook-driven async processing with real-time updates<p>## Real Results<p>Users are creating marketing videos, educational content, and social media clips that previously required $1000s in production costs. Every video includes full commercial rights with no watermarks.<p>Try it at [wanvideo.tv](<a href=\"https://wanvideo.tv\" rel=\"nofollow\">https://wanvideo.tv</a>) - free credits for new users.<p>Would love feedback on the UX and ideas for making AI video generation more accessible to developers.<p>Thanks!"},"title":{"matchLevel":"none","matchedWords":[],"value":"Show HN: WanVideo \u2013 Wan 2.2 AI Video Generation"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://www.wanvideo.tv"}},"_tags":["story","author_liuwei","story_44754638","show_hn"],"author":"liuwei","created_at":"2025-08-01T09:29:49Z","created_at_i":1754040589,"num_comments":0,"objectID":"44754638","points":1,"story_id":44754638,"story_text":"Hey HN! I built WanVideo (wanvideo.tv) to make the powerful Wan 2.2 video generation model accessible to everyone.<p>## What is Wan 2.2?<p>Wan 2.2 is an advanced AI video generation model that uses an optimized MoE (Mixture of Experts) architecture to create professional-quality videos. Key capabilities:<p>- *Cinema-Grade Aesthetics*: Professional lighting, smooth camera movements, and cinematic effects without any filming equipment\n- *Natural Motion Dynamics*: Realistic human movements, accurate physics simulation, and believable character expressions\n- *Multi-Modal Input*: Generate from text descriptions or animate existing images\n- *Lightning Fast*: 5-10 minute generation (vs hours with other models)<p>## Why We Chose Wan 2.2<p>After testing multiple video generation models, Wan 2.2 stood out for its balance of quality and efficiency. The MoE architecture allows it to specialize different aspects of video generation (motion, lighting, scene composition) while maintaining fast inference times.<p>## Technical Implementation<p>- Built on Next.js 14 + Supabase + Cloudflare Pages\n- Provider abstraction layer supporting Wan 2.2 via Kie.ai\n- Credit-based billing ($0.25 = 10 credits)\n- Webhook-driven async processing with real-time updates<p>## Real Results<p>Users are creating marketing videos, educational content, and social media clips that previously required $1000s in production costs. Every video includes full commercial rights with no watermarks.<p>Try it at [wanvideo.tv](<a href=\"https:&#x2F;&#x2F;wanvideo.tv\" rel=\"nofollow\">https:&#x2F;&#x2F;wanvideo.tv</a>) - free credits for new users.<p>Would love feedback on the UX and ideas for making AI video generation more accessible to developers.<p>Thanks!","title":"Show HN: WanVideo \u2013 Wan 2.2 AI Video Generation","updated_at":"2025-08-01T09:35:02Z","url":"https://www.wanvideo.tv"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"Koffiepoeder"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["moe","mixture","of","experts","inference"],"value":"The A3B part in the name stands for `Active 3B`, so for the <em>inference</em> jobs a core 3B is used in conjunction with another subpart <em>of</em> the model, based on the task (<em>MoE</em>, <em>mixture</em> <em>of</em> <em>experts</em>). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Unsloth Dynamic 2.0 GGUFs"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs"}},"_tags":["comment","author_Koffiepoeder","story_47192505"],"author":"Koffiepoeder","comment_text":"The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related&#x2F;similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models.","created_at":"2026-02-28T11:44:21Z","created_at_i":1772279061,"objectID":"47194016","parent_id":47193019,"story_id":47192505,"story_title":"Unsloth Dynamic 2.0 GGUFs","story_url":"https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs","updated_at":"2026-03-05T23:39:10Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"dworks"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["moe","mixture","of","experts","inference"],"value":"'The Qwen 3.5 series 397B-A17B is a native vision-language model based on a hybrid architecture design. By integrating linear attention mechanisms with sparse <em>Mixture</em>-<em>of</em>-<em>Experts</em> (<em>MoE</em>), it achieves significantly higher <em>inference</em> efficiency. It demonstrates exceptional performance\u2014comparable to current state-<em>of</em>-the-art frontier models\u2014across a wide range <em>of</em> tasks, including language understanding, logical reasoning, code generation, agentic tasks, image and video understanding, and Graphical User Interfaces (GUI). Furthermore, it possesses robust code generation and agent capabilities, showing excellent generalization across various agent-based scenarios<p>&quot;The Qwen3.5 Native Vision-Language Series Plus model is built on a hybrid architecture that integrates linear attention mechanisms with sparse <em>Mixture</em>-<em>of</em>-<em>Experts</em> (<em>MoE</em>), achieving significantly higher <em>inference</em> efficiency. Across various task evaluations, the 3.5 series demonstrates exceptional performance comparable to current state-<em>of</em>-the-art frontier models. Compared to the Qwen 3 series, this model represents a massive leap forward in both text-only and multimodal capabilities.&quot;'"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Qwen 3.5 397B and Qwen 3.5 Plus released"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://chat.qwen.ai/"}},"_tags":["comment","author_dworks","story_47032452"],"author":"dworks","children":[47032623],"comment_text":"&#x27;The Qwen 3.5 series 397B-A17B is a native vision-language model based on a hybrid architecture design. By integrating linear attention mechanisms with sparse Mixture-of-Experts (MoE), it achieves significantly higher inference efficiency. It demonstrates exceptional performance\u2014comparable to current state-of-the-art frontier models\u2014across a wide range of tasks, including language understanding, logical reasoning, code generation, agentic tasks, image and video understanding, and Graphical User Interfaces (GUI). Furthermore, it possesses robust code generation and agent capabilities, showing excellent generalization across various agent-based scenarios<p>&quot;The Qwen3.5 Native Vision-Language Series Plus model is built on a hybrid architecture that integrates linear attention mechanisms with sparse Mixture-of-Experts (MoE), achieving significantly higher inference efficiency. Across various task evaluations, the 3.5 series demonstrates exceptional performance comparable to current state-of-the-art frontier models. Compared to the Qwen 3 series, this model represents a massive leap forward in both text-only and multimodal capabilities.&quot;&#x27;","created_at":"2026-02-16T08:34:28Z","created_at_i":1771230868,"objectID":"47032453","parent_id":47032452,"story_id":47032452,"story_title":"Qwen 3.5 397B and Qwen 3.5 Plus released","story_url":"https://chat.qwen.ai/","updated_at":"2026-03-05T23:33:34Z"}],"hitsPerPage":5,"nbHits":51,"nbPages":11,"page":0,"params":"query=moe+mixture+of+experts+inference&hitsPerPage=5&advancedSyntax=true&analyticsTags=backend","processingTimeMS":31,"processingTimingsMS":{"_request":{"roundTrip":22},"afterFetch":{"merge":{"mergeLoop":{"prepareNextHit":1,"total":1},"total":1},"total":1},"fetch":{"query":16,"scanning":12,"total":29},"total":31},"query":"moe mixture of experts inference","serverTimeMS":32}