{"exhaustive":{"nbHits":false,"typo":false},"exhaustiveNbHits":false,"exhaustiveTypo":false,"hits":[{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"zamadatix"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["deepseek","v3","moe"],"value":"<em>DeepSeek</em> <em>V3</em>/R1 are <em>MoE</em> with 256 experts per layer, actively using 1 shared expert and 8 routed experts per layer <a href=\"https://arxiv.org/html/2412.19437v1#S2:~:text=with%20MoE%20layers.-,Each%20MoE%20layer,-consists%20of%201\" rel=\"nofollow\">https://arxiv.org/html/2412.19437v1#S2:~:text=with%20MoE%20l...</a> so you can't just take the active parameters and assume that's close to the size of a single expert (ignoring experts are per layer anyways and that there are still dense parameters to count).<p>Despite connotations of specialized intelligences the term &quot;expert&quot; provokes it's really mostly about scalability/efficiency of running large models. By splitting up sections of the layers and not activating all of them for each pass a single query takes less bandwidth, can be distributed across compute, and can be parallelized with other queries on the same nodes."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Meta got caught gaming AI benchmarks"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://www.theverge.com/meta/645012/meta-llama-4-maverick-benchmarks-gaming"}},"_tags":["comment","author_zamadatix","story_43620452"],"author":"zamadatix","comment_text":"DeepSeek V3&#x2F;R1 are MoE with 256 experts per layer, actively using 1 shared expert and 8 routed experts per layer <a href=\"https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2412.19437v1#S2:~:text=with%20MoE%20layers.-,Each%20MoE%20layer,-consists%20of%201\" rel=\"nofollow\">https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2412.19437v1#S2:~:text=with%20MoE%20l...</a> so you can&#x27;t just take the active parameters and assume that&#x27;s close to the size of a single expert (ignoring experts are per layer anyways and that there are still dense parameters to count).<p>Despite connotations of specialized intelligences the term &quot;expert&quot; provokes it&#x27;s really mostly about scalability&#x2F;efficiency of running large models. By splitting up sections of the layers and not activating all of them for each pass a single query takes less bandwidth, can be distributed across compute, and can be parallelized with other queries on the same nodes.","created_at":"2025-04-08T17:22:10Z","created_at_i":1744132930,"objectID":"43624206","parent_id":43623956,"story_id":43620452,"story_title":"Meta got caught gaming AI benchmarks","story_url":"https://www.theverge.com/meta/645012/meta-llama-4-maverick-benchmarks-gaming","updated_at":"2025-04-09T07:51:45Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"Palmik"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["deepseek","v3","moe"],"value":"&gt; All of this is to say that <em>DeepSeek</em>-<em>V3</em> is not a unique breakthrough or something that fundamentally changes the economics of LLM\u2019s; it\u2019s an expected point on an ongoing cost reduction curve. What\u2019s different this time is that the company that was first to demonstrate the expected cost reductions was Chinese.<p>Says the CEO whose product [1] costs 15-50x times more. (This is not just the <em>DeepSeek</em>'s API, but also 3p providers hosting the same model)<p>&gt; <em>DeepSeek</em> does not &quot;do for $6M5 what cost US AI companies billions&quot;. I can only speak for Anthropic, but Claude 3.5 Sonnet is a mid-sized model that cost a few $10M's to train<p>Ok, that's still at least 3-10x cost reduction (assuming &quot;a few $10M&quot; lowerbound of $20M). And for a model that he later implies is 2x larger than Sonnet. So that's 6-10x efficiency improvement. Nice!<p>&gt; Since <em>DeepSeek</em>-<em>V3</em> is worse than those US frontier models \u2014 let\u2019s say by ~2x on the scaling curve.<p>What curve? Does he mean the simplistic performance / model params curve? That does not take into account that <em>DeepSeek</em> <em>v3</em> is a <em>MoE</em> (can't compare <em>MoE</em> and dense param # in a naive way), nor the other architecture changes (KV compression, etc.).<p>Also, if Sonnet 3.5 is 2x smaller, then why is inference 15-50x more expensive than <em>DeepSeek</em> <em>v3</em>'s? Does Anthropic not have good GPU engineers? Are they just running at insanely high margins? As a consumer I don't care how big your model is behind the scenes. I care about API costs or inference efficiency when hosting the model myself.<p>[1] Product that is mostly comparable and in some ways quite ahead."},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["deepseek"],"value":"On <em>DeepSeek</em> and export controls"},"story_url":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["deepseek"],"value":"https://darioamodei.com/on-<em>deepseek</em>-and-export-controls"}},"_tags":["comment","author_Palmik","story_42866905"],"author":"Palmik","children":[42867975],"comment_text":"&gt; All of this is to say that DeepSeek-V3 is not a unique breakthrough or something that fundamentally changes the economics of LLM\u2019s; it\u2019s an expected point on an ongoing cost reduction curve. What\u2019s different this time is that the company that was first to demonstrate the expected cost reductions was Chinese.<p>Says the CEO whose product [1] costs 15-50x times more. (This is not just the DeepSeek&#x27;s API, but also 3p providers hosting the same model)<p>&gt; DeepSeek does not &quot;do for $6M5 what cost US AI companies billions&quot;. I can only speak for Anthropic, but Claude 3.5 Sonnet is a mid-sized model that cost a few $10M&#x27;s to train<p>Ok, that&#x27;s still at least 3-10x cost reduction (assuming &quot;a few $10M&quot; lowerbound of $20M). And for a model that he later implies is 2x larger than Sonnet. So that&#x27;s 6-10x efficiency improvement. Nice!<p>&gt; Since DeepSeek-V3 is worse than those US frontier models \u2014 let\u2019s say by ~2x on the scaling curve.<p>What curve? Does he mean the simplistic performance &#x2F; model params curve? That does not take into account that DeepSeek v3 is a MoE (can&#x27;t compare MoE and dense param # in a naive way), nor the other architecture changes (KV compression, etc.).<p>Also, if Sonnet 3.5 is 2x smaller, then why is inference 15-50x more expensive than DeepSeek v3&#x27;s? Does Anthropic not have good GPU engineers? Are they just running at insanely high margins? As a consumer I don&#x27;t care how big your model is behind the scenes. I care about API costs or inference efficiency when hosting the model myself.<p>[1] Product that is mostly comparable and in some ways quite ahead.","created_at":"2025-01-29T17:18:19Z","created_at_i":1738171099,"objectID":"42867960","parent_id":42866905,"story_id":42866905,"story_title":"On DeepSeek and export controls","story_url":"https://darioamodei.com/on-deepseek-and-export-controls","updated_at":"2025-01-30T11:31:57Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"talldayo"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["deepseek","v3","moe"],"value":"<em>Deepseek</em> <em>v3</em> is an <em>MoE</em> model - not every parameter is activated at the same time."},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["deepseek","v3"],"value":"Ask HN: <em>Deepseek</em> <em>v3</em> how can I check on local computer?"}},"_tags":["comment","author_talldayo","story_42532029"],"author":"talldayo","children":[42533762],"comment_text":"Deepseek v3 is an MoE model - not every parameter is activated at the same time.","created_at":"2024-12-28T18:56:54Z","created_at_i":1735412214,"objectID":"42533463","parent_id":42532424,"story_id":42532029,"story_title":"Ask HN: Deepseek v3 how can I check on local computer?","updated_at":"2024-12-28T19:32:07Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"fossa1"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["deepseek","v3","moe"],"value":"It\u2019s ironic: for years the open-source community was trying to match GPT-3 (175B dense) with 30B\u201370B models + RLHF + synthetic data\u2014and the performance gap persisted.<p>Turns out, size really did matter, at least at the base model level. Only with the release of truly massive dense (405B) or high-activation <em>MoE</em> models (<em>DeepSeek</em> <em>V3</em>, DBRX, etc) did we start seeing GPT-4-level reasoning emerge outside closed labs."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"How large are large language models?"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://gist.github.com/rain-1/cf0419958250d15893d8873682492c3e"}},"_tags":["comment","author_fossa1","story_44442072"],"author":"fossa1","comment_text":"It\u2019s ironic: for years the open-source community was trying to match GPT-3 (175B dense) with 30B\u201370B models + RLHF + synthetic data\u2014and the performance gap persisted.<p>Turns out, size really did matter, at least at the base model level. Only with the release of truly massive dense (405B) or high-activation MoE models (DeepSeek V3, DBRX, etc) did we start seeing GPT-4-level reasoning emerge outside closed labs.","created_at":"2025-07-02T12:56:56Z","created_at_i":1751461016,"objectID":"44443183","parent_id":44442072,"story_id":44442072,"story_title":"How large are large language models?","story_url":"https://gist.github.com/rain-1/cf0419958250d15893d8873682492c3e","updated_at":"2025-07-03T06:14:24Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"andrewgross"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["deepseek","v3","moe"],"value":"&gt; The beauty of the <em>MOE</em> model approach is that you can decompose the big model into a collection of smaller models that each know different, non-overlapping (at least fully) pieces of knowledge.<p>I was under the impression that this was not how <em>MoE</em> models work. They are not a collection of independent models, but instead a way of routing to a subset of active parameters at each layer.  There is no &quot;expert&quot; that is loaded or unloaded per question. All of the weights are loaded in VRAM, its just a matter of which are actually loaded to the registers for calculation.  As far as I could tell from the <em>Deepseek</em> <em>v3</em>/v2 papers, their <em>MoE</em> approach follows this instead of being an explicit collection of experts. If thats the case, theres no VRAM saving to be had using an <em>MOE</em> nor an ability to extract the weights of the expert to run locally (aside from distillation or similar).<p>If there is someone more versed on the construction of <em>MoE</em> architectures I would love some help understanding what I missed here."},"story_title":{"fullyHighlighted":false,"matchLevel":"partial","matchedWords":["deepseek"],"value":"The impact of competition and <em>DeepSeek</em> on Nvidia"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://youtubetranscriptoptimizer.com/blog/05_the_short_case_for_nvda"}},"_tags":["comment","author_andrewgross","story_42822162"],"author":"andrewgross","children":[42836829],"comment_text":"&gt; The beauty of the MOE model approach is that you can decompose the big model into a collection of smaller models that each know different, non-overlapping (at least fully) pieces of knowledge.<p>I was under the impression that this was not how MoE models work. They are not a collection of independent models, but instead a way of routing to a subset of active parameters at each layer.  There is no &quot;expert&quot; that is loaded or unloaded per question. All of the weights are loaded in VRAM, its just a matter of which are actually loaded to the registers for calculation.  As far as I could tell from the Deepseek v3&#x2F;v2 papers, their MoE approach follows this instead of being an explicit collection of experts. If thats the case, theres no VRAM saving to be had using an MOE nor an ability to extract the weights of the expert to run locally (aside from distillation or similar).<p>If there is someone more versed on the construction of MoE architectures I would love some help understanding what I missed here.","created_at":"2025-01-27T01:42:20Z","created_at_i":1737942140,"objectID":"42836422","parent_id":42822162,"story_id":42822162,"story_title":"The impact of competition and DeepSeek on Nvidia","story_url":"https://youtubetranscriptoptimizer.com/blog/05_the_short_case_for_nvda","updated_at":"2025-01-28T01:17:42Z"}],"hitsPerPage":5,"nbHits":602,"nbPages":121,"page":0,"params":"query=deepseek+v3+moe&hitsPerPage=5&advancedSyntax=true&analyticsTags=backend","processingTimeMS":25,"processingTimingsMS":{"_request":{"queue":44,"roundTrip":25},"afterFetch":{"merge":{"mergeLoop":{"prepareNextHit":1,"total":1},"total":5},"total":6},"fetch":{"query":9,"scanning":8,"total":18},"total":25},"query":"deepseek v3 moe","serverTimeMS":70}