{"exhaustive":{"nbHits":false,"typo":false},"exhaustiveNbHits":false,"exhaustiveTypo":false,"hits":[{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"rmason"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"<em>Crawl4AI</em> is an open-source, LLM-friendly web crawler and scraper"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"https://github.com/unclecode/<em>crawl4ai</em>"}},"_tags":["story","author_rmason","story_43827993"],"author":"rmason","children":[43827998,43828356],"created_at":"2025-04-29T01:49:19Z","created_at_i":1745891359,"num_comments":2,"objectID":"43827993","points":7,"story_id":43827993,"title":"Crawl4AI is an open-source, LLM-friendly web crawler and scraper","updated_at":"2025-05-02T13:07:42Z","url":"https://github.com/unclecode/crawl4ai"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"ProbeCraft"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"<em>Crawl4AI</em>: Open-Source Web Crawler for Seamless AI Data Scraping"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"https://github.com/unclecode/<em>crawl4ai</em>"}},"_tags":["story","author_ProbeCraft","story_41689534"],"author":"ProbeCraft","created_at":"2024-09-29T18:58:44Z","created_at_i":1727636324,"num_comments":0,"objectID":"41689534","points":6,"story_id":41689534,"title":"Crawl4AI: Open-Source Web Crawler for Seamless AI Data Scraping","updated_at":"2025-03-10T12:03:34Z","url":"https://github.com/unclecode/crawl4ai"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"hbcondo714"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"<em>crawl4ai</em>: The Adaptive Intelligence Update"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"https://github.com/unclecode/<em>crawl4ai</em>/releases/tag/v0.7.0"}},"_tags":["story","author_hbcondo714","story_44543172"],"author":"hbcondo714","created_at":"2025-07-12T16:35:06Z","created_at_i":1752338106,"num_comments":0,"objectID":"44543172","points":4,"story_id":44543172,"title":"crawl4ai: The Adaptive Intelligence Update","updated_at":"2025-07-12T19:57:16Z","url":"https://github.com/unclecode/crawl4ai/releases/tag/v0.7.0"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"bjornroberg"},"title":{"fullyHighlighted":true,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"<em>Crawl4AI</em>"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"https://docs.<em>crawl4ai</em>.com/"}},"_tags":["story","author_bjornroberg","story_47481953"],"author":"bjornroberg","children":[47482012],"created_at":"2026-03-22T20:46:00Z","created_at_i":1774212360,"num_comments":1,"objectID":"47481953","points":2,"story_id":47481953,"title":"Crawl4AI","updated_at":"2026-03-22T22:31:33Z","url":"https://docs.crawl4ai.com/"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"medivhX"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"<em>Crawl4AI</em> Unofficial Guides: Master High-Performance AI Web Scraping<p>A comprehensive, community-driven resource dedicated to <em>Crawl4AI</em>, the leading open-source LLM-friendly web crawler. This site provides production-ready integration guides for Cursor MCP, n8n automation, and Docker deployments. It features detailed performance comparisons against proprietary alternatives like Firecrawl, helping developers build scalable, self-hosted data pipelines for AI agents and RAG systems."},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"Show HN: <em>Crawl4AI</em> \u2013 Open-Source Web Crawler for LLMs and Structured Data"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"https://<em>crawl4ai</em>.dev"}},"_tags":["story","author_medivhX","story_46646798","show_hn"],"author":"medivhX","children":[46646807],"created_at":"2026-01-16T14:39:50Z","created_at_i":1768574390,"num_comments":1,"objectID":"46646798","points":2,"story_id":46646798,"story_text":"Crawl4AI Unofficial Guides: Master High-Performance AI Web Scraping<p>A comprehensive, community-driven resource dedicated to Crawl4AI, the leading open-source LLM-friendly web crawler. This site provides production-ready integration guides for Cursor MCP, n8n automation, and Docker deployments. It features detailed performance comparisons against proprietary alternatives like Firecrawl, helping developers build scalable, self-hosted data pipelines for AI agents and RAG systems.","title":"Show HN: Crawl4AI \u2013 Open-Source Web Crawler for LLMs and Structured Data","updated_at":"2026-03-05T23:25:54Z","url":"https://crawl4ai.dev"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"synergy20"},"title":{"fullyHighlighted":true,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"<em>Crawl4AI</em>"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"https://github.com/unclecode/<em>crawl4ai</em>"}},"_tags":["story","author_synergy20","story_41747108"],"author":"synergy20","created_at":"2024-10-05T01:46:45Z","created_at_i":1728092805,"num_comments":0,"objectID":"41747108","points":2,"story_id":41747108,"title":"Crawl4AI","updated_at":"2024-10-05T05:52:39Z","url":"https://github.com/unclecode/crawl4ai"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"zoooey"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"<em>Crawl4AI</em>: Open-Source LLM Friendly Web Crawler and Scraper"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"https://github.com/unclecode/<em>crawl4ai</em>"}},"_tags":["story","author_zoooey","story_45963180"],"author":"zoooey","created_at":"2025-11-18T09:49:52Z","created_at_i":1763459392,"num_comments":0,"objectID":"45963180","points":1,"story_id":45963180,"title":"Crawl4AI: Open-Source LLM Friendly Web Crawler and Scraper","updated_at":"2026-03-05T23:01:55Z","url":"https://github.com/unclecode/crawl4ai"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"basic_banana"},"title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"<em>Crawl4ai</em> \u2013 open-source web crawler for your AI agent"},"url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"https://github.com/unclecode/<em>crawl4ai</em>"}},"_tags":["story","author_basic_banana","story_43048990"],"author":"basic_banana","created_at":"2025-02-14T15:00:35Z","created_at_i":1739545235,"num_comments":0,"objectID":"43048990","points":1,"story_id":43048990,"title":"Crawl4ai \u2013 open-source web crawler for your AI agent","updated_at":"2025-02-14T15:03:19Z","url":"https://github.com/unclecode/crawl4ai"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"schoblaska"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"Hey HN! Over the weekend (leaning heavily on Opus 4.5) I wrote Jargon - an AI-managed zettelkasten that reads articles, papers, and YouTube videos, extracts the key ideas, and automatically links related concepts together.<p>Demo video: <a href=\"https://youtu.be/W7ejMqZ6EUQ\" rel=\"nofollow\">https://youtu.be/W7ejMqZ6EUQ</a><p>Repo: <a href=\"https://github.com/schoblaska/jargon\" rel=\"nofollow\">https://github.com/schoblaska/jargon</a><p>You can paste an article, PDF link, or YouTube video to parse, or ask questions directly and it'll find its own content. Sources get summarized, broken into insight cards, and embedded for semantic search. Similar ideas automatically cluster together. Each insight can spawn research threads - questions that trigger web searches to pull in related content, which flows through the same pipeline.<p>You can explore the graph of linked ideas directly, or ask questions and it'll RAG over your whole library plus fresh web results.<p>Jargon uses Rails + Hotwire with Falcon for async processing, pgvector for embeddings, Exa for neural web search, <em>crawl4ai</em> as a fallback scraper, and pdftotext for academic papers."},"title":{"matchLevel":"none","matchedWords":[],"value":"Show HN: An AI zettelkasten that extracts ideas from articles, videos, and PDFs"},"url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/schoblaska/jargon"}},"_tags":["story","author_schoblaska","story_46110897","show_hn"],"author":"schoblaska","children":[46114054,46114344,46114495,46116921,46118108,46119359,46130474,46172230],"created_at":"2025-12-01T18:20:46Z","created_at_i":1764613246,"num_comments":9,"objectID":"46110897","points":38,"story_id":46110897,"story_text":"Hey HN! Over the weekend (leaning heavily on Opus 4.5) I wrote Jargon - an AI-managed zettelkasten that reads articles, papers, and YouTube videos, extracts the key ideas, and automatically links related concepts together.<p>Demo video: <a href=\"https:&#x2F;&#x2F;youtu.be&#x2F;W7ejMqZ6EUQ\" rel=\"nofollow\">https:&#x2F;&#x2F;youtu.be&#x2F;W7ejMqZ6EUQ</a><p>Repo: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;schoblaska&#x2F;jargon\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;schoblaska&#x2F;jargon</a><p>You can paste an article, PDF link, or YouTube video to parse, or ask questions directly and it&#x27;ll find its own content. Sources get summarized, broken into insight cards, and embedded for semantic search. Similar ideas automatically cluster together. Each insight can spawn research threads - questions that trigger web searches to pull in related content, which flows through the same pipeline.<p>You can explore the graph of linked ideas directly, or ask questions and it&#x27;ll RAG over your whole library plus fresh web results.<p>Jargon uses Rails + Hotwire with Falcon for async processing, pgvector for embeddings, Exa for neural web search, crawl4ai as a fallback scraper, and pdftotext for academic papers.","title":"Show HN: An AI zettelkasten that extracts ideas from articles, videos, and PDFs","updated_at":"2026-03-05T23:04:46Z","url":"https://github.com/schoblaska/jargon"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"mxfeinberg"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"As a long time data scientist and engineer, I've had to write a couple of quick and dirty scrapers and bots over the years using selenium and more recently playwright. I haven't really been tracking it, but I've also been reading about the <em>crawl4ai</em> project.<p>With the explosion of AI agents, I've been playing around with building agentic scrapers that can simply be given a prompt and a target site and are able to return structured data in a specified format. I've also been playing around with adding in steps that have a different model/step attempt to define the structured format dynamically.<p>However, as with most AI projects, the token consumption can scale pretty aggressively.<p>Has anyone else been working on similar projects? Would people realistically pay $0.025 to $0.03 per request?"},"title":{"matchLevel":"none","matchedWords":[],"value":"Ask HN: Is there a market for agentic scraping tools?"}},"_tags":["story","author_mxfeinberg","story_44474800","ask_hn"],"author":"mxfeinberg","children":[44474909,44475773],"created_at":"2025-07-05T19:07:46Z","created_at_i":1751742466,"num_comments":2,"objectID":"44474800","points":3,"story_id":44474800,"story_text":"As a long time data scientist and engineer, I&#x27;ve had to write a couple of quick and dirty scrapers and bots over the years using selenium and more recently playwright. I haven&#x27;t really been tracking it, but I&#x27;ve also been reading about the crawl4ai project.<p>With the explosion of AI agents, I&#x27;ve been playing around with building agentic scrapers that can simply be given a prompt and a target site and are able to return structured data in a specified format. I&#x27;ve also been playing around with adding in steps that have a different model&#x2F;step attempt to define the structured format dynamically.<p>However, as with most AI projects, the token consumption can scale pretty aggressively.<p>Has anyone else been working on similar projects? Would people realistically pay $0.025 to $0.03 per request?","title":"Ask HN: Is there a market for agentic scraping tools?","updated_at":"2025-07-06T07:57:05Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"dhruv_ahuja"},"story_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"I am curious, what are people and organisations building with web crawlers, especially since now all of them seem to have LLM support for further refining the output from crawled pages.<p>Open source projects like <em>crawl4ai</em>[1] have a crazy amount of stars (~59k). This got me curious.<p><a href=\"https://github.com/unclecode/crawl4ai\" rel=\"nofollow\">https://github.com/unclecode/<em>crawl4ai</em></a>"},"title":{"matchLevel":"none","matchedWords":[],"value":"Ask HN: What do you use (AI based) web crawlers for?"}},"_tags":["story","author_dhruv_ahuja","story_46733548","ask_hn"],"author":"dhruv_ahuja","created_at":"2026-01-23T15:21:22Z","created_at_i":1769181682,"num_comments":0,"objectID":"46733548","points":3,"story_id":46733548,"story_text":"I am curious, what are people and organisations building with web crawlers, especially since now all of them seem to have LLM support for further refining the output from crawled pages.<p>Open source projects like crawl4ai[1] have a crazy amount of stars (~59k). This got me curious.<p><a href=\"https:&#x2F;&#x2F;github.com&#x2F;unclecode&#x2F;crawl4ai\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;unclecode&#x2F;crawl4ai</a>","title":"Ask HN: What do you use (AI based) web crawlers for?","updated_at":"2026-03-05T23:24:06Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"jordan_gibbs"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"HyperResearch is a simple Claude Code skill harness that outperforms every deep research framework.<p>HyperResearch surpasses OpenAI, Google, and NVIDIA's offerings in the agentic search space based on DeepResearch Bench. It's open-source, installable with a single command, and uses your CC subscription, so you don't have to pay for OpenAI or Gemini Pro.<p>It uses a 16-step pipeline that creates a searchable, persistent knowledge store during each session that can be built upon in later searches. I designed it to align with the original user prompt as closely as possible, while incorporating built-in fact-checking, adversarial review, and breadth and depth-investigating capabilities.<p>This is a generalized framework; you can use it for any large-scale research task, from developing a trading strategy for a specific stock to analyzing competitor products to understanding the current state of the art in LLM architecture.<p>It uses <em>crawl4ai</em> (an open-source LLM search tool) to capture a wider breadth of information than the standard websearch tool is capable of. You can also configure authenticated sessions, meaning that LinkedIn, Twitter, etc., are now fair game for agentic search.<p>If anyone wants to port it to Codex, be my guest!"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Converting Claude Code into the top scoring deep research agent"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/jordan-gibbs/hyperresearch"}},"_tags":["comment","author_jordan_gibbs","story_47953372"],"author":"jordan_gibbs","comment_text":"HyperResearch is a simple Claude Code skill harness that outperforms every deep research framework.<p>HyperResearch surpasses OpenAI, Google, and NVIDIA&#x27;s offerings in the agentic search space based on DeepResearch Bench. It&#x27;s open-source, installable with a single command, and uses your CC subscription, so you don&#x27;t have to pay for OpenAI or Gemini Pro.<p>It uses a 16-step pipeline that creates a searchable, persistent knowledge store during each session that can be built upon in later searches. I designed it to align with the original user prompt as closely as possible, while incorporating built-in fact-checking, adversarial review, and breadth and depth-investigating capabilities.<p>This is a generalized framework; you can use it for any large-scale research task, from developing a trading strategy for a specific stock to analyzing competitor products to understanding the current state of the art in LLM architecture.<p>It uses crawl4ai (an open-source LLM search tool) to capture a wider breadth of information than the standard websearch tool is capable of. You can also configure authenticated sessions, meaning that LinkedIn, Twitter, etc., are now fair game for agentic search.<p>If anyone wants to port it to Codex, be my guest!","created_at":"2026-04-29T19:39:30Z","created_at_i":1777491570,"objectID":"47953373","parent_id":47953372,"story_id":47953372,"story_title":"Converting Claude Code into the top scoring deep research agent","story_url":"https://github.com/jordan-gibbs/hyperresearch","updated_at":"2026-04-29T19:42:36Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"breed101"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"replaces claude code's webfetch - claude code webfetch hangs indefinitely, does not have timeout. does not pierce dom js-rendered. the <em>crawl4ai</em>, firecrawl tools do not pierce shadow dom. the kiss solution, just select all copy paste the display text. this returns actions available, text offset / limits. great for the agent."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Because Webfetch Sucks"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/bxxd/mcp-fetch-ux"}},"_tags":["comment","author_breed101","story_47874773"],"author":"breed101","comment_text":"replaces claude code&#x27;s webfetch - claude code webfetch hangs indefinitely, does not have timeout. does not pierce dom js-rendered. the crawl4ai, firecrawl tools do not pierce shadow dom. the kiss solution, just select all copy paste the display text. this returns actions available, text offset &#x2F; limits. great for the agent.","created_at":"2026-04-23T12:07:56Z","created_at_i":1776946076,"objectID":"47874774","parent_id":47874773,"story_id":47874773,"story_title":"Because Webfetch Sucks","story_url":"https://github.com/bxxd/mcp-fetch-ux","updated_at":"2026-04-23T12:10:10Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"bredren"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"Why not just write a skill and script that calls <em>crawl4ai</em> or similar and do this using Claude code?<p>You can store the page as markdown for future sessions, mash the data w other context, you name it.<p>The web Claude is incredibly limited both in capability and workflow integration. Doesn\u2019t matter if you\u2019re dealing with bids from arbor contractors or researching solutions for a DB problem."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Switch to Claude without starting over"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://claude.com/import-memory"}},"_tags":["comment","author_bredren","story_47204571"],"author":"bredren","children":[47209190,47209363],"comment_text":"Why not just write a skill and script that calls crawl4ai or similar and do this using Claude code?<p>You can store the page as markdown for future sessions, mash the data w other context, you name it.<p>The web Claude is incredibly limited both in capability and workflow integration. Doesn\u2019t matter if you\u2019re dealing with bids from arbor contractors or researching solutions for a DB problem.","created_at":"2026-03-01T17:52:50Z","created_at_i":1772387570,"objectID":"47208961","parent_id":47206490,"story_id":47204571,"story_title":"Switch to Claude without starting over","story_url":"https://claude.com/import-memory","updated_at":"2026-03-05T23:39:06Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"Roark66"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"Cloudflare is ridiculous. I can't even open it using Cromite (privacy enhanced, but not over the top, android browser).<p>I get:<p>blog.adafruit.com\nVerifying you are human. This may take a few seconds.<p>blog.adafruit.com needs to review the security of your connection before proceeding.<p>And this hangs forever. What difference does it make if I access this site using a browser (blocked anyway) or I asked my LLM to fetch the content? I bet my LLM coukd get it anyway as I'm using basic local scraping with firecrawl for backup. So my LLM if it fails to retrieve using my basic local <em>crawl4ai</em> will use my paid firecrawl api and those guys can scrape EVERYTHING.<p>I do not understand why do you (as a site owner) care? Are these bots generating so much traffic? Can you set it up to serve text only version to them then?"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"New York\u2019s budget bill would require \u201cblocking technology\u201d on all 3D printers"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://blog.adafruit.com/2026/02/03/new-york-wants-to-ctrlaltdelete-your-3d-printer/"}},"_tags":["comment","author_Roark66","story_46872540"],"author":"Roark66","children":[46887624,46887964],"comment_text":"Cloudflare is ridiculous. I can&#x27;t even open it using Cromite (privacy enhanced, but not over the top, android browser).<p>I get:<p>blog.adafruit.com\nVerifying you are human. This may take a few seconds.<p>blog.adafruit.com needs to review the security of your connection before proceeding.<p>And this hangs forever. What difference does it make if I access this site using a browser (blocked anyway) or I asked my LLM to fetch the content? I bet my LLM coukd get it anyway as I&#x27;m using basic local scraping with firecrawl for backup. So my LLM if it fails to retrieve using my basic local crawl4ai will use my paid firecrawl api and those guys can scrape EVERYTHING.<p>I do not understand why do you (as a site owner) care? Are these bots generating so much traffic? Can you set it up to serve text only version to them then?","created_at":"2026-02-04T16:02:21Z","created_at_i":1770220941,"objectID":"46887472","parent_id":46886301,"story_id":46872540,"story_title":"New York\u2019s budget bill would require \u201cblocking technology\u201d on all 3D printers","story_url":"https://blog.adafruit.com/2026/02/03/new-york-wants-to-ctrlaltdelete-your-3d-printer/","updated_at":"2026-06-02T20:39:09Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"medivhX"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"Hi HN,<p><em>Crawl4AI</em> is an amazing open-source library that solves many LLM-scraping headaches, but I found that new developers often struggle with production configurations\u2014specifically how to use it with MCP servers for Cursor, or how to bridge it with automation tools like n8n.<p>I built <em>crawl4ai</em>.dev as a community-driven documentation hub to fill these gaps. It includes:<p>One-click Docker setups for n8n/FastAPI.<p>Production-ready MCP server guides for Cursor &amp; Claude.<p>Cost/performance benchmarks vs proprietary tools like Firecrawl.<p>The goal is to help everyone build self-hosted, affordable AI data pipelines. I'd love to hear your feedback on the guides or what other integrations you'd like to see documented!"},"story_title":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"Show HN: <em>Crawl4AI</em> \u2013 Open-Source Web Crawler for LLMs and Structured Data"},"story_url":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"https://<em>crawl4ai</em>.dev"}},"_tags":["comment","author_medivhX","story_46646798"],"author":"medivhX","comment_text":"Hi HN,<p>Crawl4AI is an amazing open-source library that solves many LLM-scraping headaches, but I found that new developers often struggle with production configurations\u2014specifically how to use it with MCP servers for Cursor, or how to bridge it with automation tools like n8n.<p>I built crawl4ai.dev as a community-driven documentation hub to fill these gaps. It includes:<p>One-click Docker setups for n8n&#x2F;FastAPI.<p>Production-ready MCP server guides for Cursor &amp; Claude.<p>Cost&#x2F;performance benchmarks vs proprietary tools like Firecrawl.<p>The goal is to help everyone build self-hosted, affordable AI data pipelines. I&#x27;d love to hear your feedback on the guides or what other integrations you&#x27;d like to see documented!","created_at":"2026-01-16T14:41:09Z","created_at_i":1768574469,"objectID":"46646807","parent_id":46646798,"story_id":46646798,"story_title":"Show HN: Crawl4AI \u2013 Open-Source Web Crawler for LLMs and Structured Data","story_url":"https://crawl4ai.dev","updated_at":"2026-03-05T23:25:54Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"Roark66"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"I'm glad the author clarified he wants to prevent his instance from crashing not simply &quot;block robots and allow humans&quot;.<p>I think the idea that you can block bots and allow humans is fallacious.<p>We should focus on a specific behaviour that causes problems (like making a bajillion requests one for each commit, instead of cloning the repo). To fix this we should block clients that work in such ways. If these bots learn to request at a reasonable pace why cares if they are bots, humans, bots under a control of an individual human, bots owned by a huge company scraping for training data? Once you make your code (or anything else) public, then trying to limit access to only a certain class of consumers is a waste of effort.<p>Also, perhaps I'm biased, because I run a searXNG and <em>Crawl4AI</em> (and few ancillaries like jina rerank etc) in my homelab so I can tell my AI to perform live internet searches as well as it can get any website. For code it has a way to clone stuff, but for things like issues, discussions, PRs it goes mostly to GitHub.<p>I like that my AI can browse almost like me. I think this is the future way to consume a lot of the web (except sites like this one that are an actual pleasure to use).<p>The models sometimes hit sites they can't fetch. For this I use Firecrawl. I use MCP proxy that lets me rewrite the tool descriptions so my models get access to both my local <em>Crawl4ai</em> and hosted (and rather expensive)firecrawl, but they are told to use Firecrawl as last resort.<p>The more people use these kinds of solutions the more incentive there will be for sites not to block users that use automation. Of course they will have to rely on alternative monetisation methods, but I think eventually these stupid capchas will disappear and reasonable rate limiting will prevail."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"How I protect my Forgejo instance from AI web crawlers"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://her.esy.fun/posts/0031-how-i-protect-my-forgejo-instance-from-ai-web-crawlers/index.html"}},"_tags":["comment","author_Roark66","story_46345205"],"author":"Roark66","children":[46353065,46353158,46353924],"comment_text":"I&#x27;m glad the author clarified he wants to prevent his instance from crashing not simply &quot;block robots and allow humans&quot;.<p>I think the idea that you can block bots and allow humans is fallacious.<p>We should focus on a specific behaviour that causes problems (like making a bajillion requests one for each commit, instead of cloning the repo). To fix this we should block clients that work in such ways. If these bots learn to request at a reasonable pace why cares if they are bots, humans, bots under a control of an individual human, bots owned by a huge company scraping for training data? Once you make your code (or anything else) public, then trying to limit access to only a certain class of consumers is a waste of effort.<p>Also, perhaps I&#x27;m biased, because I run a searXNG and Crawl4AI (and few ancillaries like jina rerank etc) in my homelab so I can tell my AI to perform live internet searches as well as it can get any website. For code it has a way to clone stuff, but for things like issues, discussions, PRs it goes mostly to GitHub.<p>I like that my AI can browse almost like me. I think this is the future way to consume a lot of the web (except sites like this one that are an actual pleasure to use).<p>The models sometimes hit sites they can&#x27;t fetch. For this I use Firecrawl. I use MCP proxy that lets me rewrite the tool descriptions so my models get access to both my local Crawl4ai and hosted (and rather expensive)firecrawl, but they are told to use Firecrawl as last resort.<p>The more people use these kinds of solutions the more incentive there will be for sites not to block users that use automation. Of course they will have to rely on alternative monetisation methods, but I think eventually these stupid capchas will disappear and reasonable rate limiting will prevail.","created_at":"2025-12-22T10:46:50Z","created_at_i":1766400410,"objectID":"46353048","parent_id":46345205,"story_id":46345205,"story_title":"How I protect my Forgejo instance from AI web crawlers","story_url":"https://her.esy.fun/posts/0031-how-i-protect-my-forgejo-instance-from-ai-web-crawlers/index.html","updated_at":"2026-03-05T23:16:15Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"kordlessagain"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"The &quot;economies of scale&quot; defense of Cloudflare ignores a fundamental reality: 23.8 million websites run on Cloudflare's free tier versus only 210,000 paying customers or so. Free users are not a strategic asset. They are an uncompensated cost, full stop. Cloudflare doesn't absorb this loss out of altruism; they monetize it by building AI bot-detection systems, charging for bot mitigation, and extracting threat intelligence data. Today's outage was caused by a bug in Cloudflare's service to combat bots.<p>That's AI bots, BTW. Bots like Playwright or <em>Crawl4AI</em>, which provide a useful service to individuals using agentic AI. Cloudflare is hostile to these types of users, even though they likely cost websites nothing to support well.<p>The &quot;scale saves money&quot; argument commits a critical error: it counts only the benefits of concentration while externally distributing the costs.<p>Yes, economies of scale exist. But Cloudflare's scale creates catastrophic systemic risk that individual companies using cloud compute never would. An estimated $5-15 billion was lost for every hour of the outage according to Tom's Guide. That cost didn't disappear. It was transferred to millions of websites, businesses, and users who had zero choice in the matter.<p>Again, corporations shitting on free users. It's a bad habit and a dark pattern.<p>Even worse, were you hoping to call an Uber this morning for your $5K vacation? Good luck.<p>This is worse than pure economic inefficiency. Cloudflare operates as an authorized man-in-the-middle to 20% of the internet, decrypting and inspecting traffic flows. When their systems fail, not due to attacks, but to internal bugs in their monetization systems, they don't just lose uptime.<p>They create a security vulnerability where encrypted connections briefly lose their encryption guarantee. They've done this before (Cloudbleed), and they'll do it again. Stop pretending to have rational arguments with irrational future outcomes.<p>The deeper problem: compute, storage, and networking are cheap. The &quot;we need Cloudflare's scale for DDoS protection&quot; argument is a circular justification for the very concentration that makes DDoS attractive in the first place. In a fragmented internet with 10 CDNs, a successful DDoS on one affects 10% of users. In a Cloudflare-dependent internet, a DDoS, or a bug, affects 50%, if Cloudflare is unable to mitigate (or DDoSs themselves).<p>Cloudflare has inserted themselves as an unremovable chokepoint. Their business model depends on staying that chokepoint. Their argument for why they must stay a chokepoint is self-reinforcing. And every outage proves the model is rotten."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Cloudflare Global Network experiencing issues"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://www.cloudflarestatus.com/incidents/8gmgl950y3h7"}},"_tags":["comment","author_kordlessagain","story_45963780"],"author":"kordlessagain","children":[45972424],"comment_text":"The &quot;economies of scale&quot; defense of Cloudflare ignores a fundamental reality: 23.8 million websites run on Cloudflare&#x27;s free tier versus only 210,000 paying customers or so. Free users are not a strategic asset. They are an uncompensated cost, full stop. Cloudflare doesn&#x27;t absorb this loss out of altruism; they monetize it by building AI bot-detection systems, charging for bot mitigation, and extracting threat intelligence data. Today&#x27;s outage was caused by a bug in Cloudflare&#x27;s service to combat bots.<p>That&#x27;s AI bots, BTW. Bots like Playwright or Crawl4AI, which provide a useful service to individuals using agentic AI. Cloudflare is hostile to these types of users, even though they likely cost websites nothing to support well.<p>The &quot;scale saves money&quot; argument commits a critical error: it counts only the benefits of concentration while externally distributing the costs.<p>Yes, economies of scale exist. But Cloudflare&#x27;s scale creates catastrophic systemic risk that individual companies using cloud compute never would. An estimated $5-15 billion was lost for every hour of the outage according to Tom&#x27;s Guide. That cost didn&#x27;t disappear. It was transferred to millions of websites, businesses, and users who had zero choice in the matter.<p>Again, corporations shitting on free users. It&#x27;s a bad habit and a dark pattern.<p>Even worse, were you hoping to call an Uber this morning for your $5K vacation? Good luck.<p>This is worse than pure economic inefficiency. Cloudflare operates as an authorized man-in-the-middle to 20% of the internet, decrypting and inspecting traffic flows. When their systems fail, not due to attacks, but to internal bugs in their monetization systems, they don&#x27;t just lose uptime.<p>They create a security vulnerability where encrypted connections briefly lose their encryption guarantee. They&#x27;ve done this before (Cloudbleed), and they&#x27;ll do it again. Stop pretending to have rational arguments with irrational future outcomes.<p>The deeper problem: compute, storage, and networking are cheap. The &quot;we need Cloudflare&#x27;s scale for DDoS protection&quot; argument is a circular justification for the very concentration that makes DDoS attractive in the first place. In a fragmented internet with 10 CDNs, a successful DDoS on one affects 10% of users. In a Cloudflare-dependent internet, a DDoS, or a bug, affects 50%, if Cloudflare is unable to mitigate (or DDoSs themselves).<p>Cloudflare has inserted themselves as an unremovable chokepoint. Their business model depends on staying that chokepoint. Their argument for why they must stay a chokepoint is self-reinforcing. And every outage proves the model is rotten.","created_at":"2025-11-18T19:35:03Z","created_at_i":1763494503,"objectID":"45970923","parent_id":45970327,"story_id":45963780,"story_title":"Cloudflare Global Network experiencing issues","story_url":"https://www.cloudflarestatus.com/incidents/8gmgl950y3h7","updated_at":"2026-03-05T23:02:21Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"bhy"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"What is c4ai? <em>Crawl4ai</em>?"},"story_title":{"matchLevel":"none","matchedWords":[],"value":"ChatGPT Developer Mode: Full MCP client access"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://platform.openai.com/docs/guides/developer-mode"}},"_tags":["comment","author_bhy","story_45199713"],"author":"bhy","children":[45213490],"comment_text":"What is c4ai? Crawl4ai?","created_at":"2025-09-11T00:33:23Z","created_at_i":1757550803,"objectID":"45206141","parent_id":45203013,"story_id":45199713,"story_title":"ChatGPT Developer Mode: Full MCP client access","story_url":"https://platform.openai.com/docs/guides/developer-mode","updated_at":"2026-03-05T22:38:33Z"},{"_highlightResult":{"author":{"matchLevel":"none","matchedWords":[],"value":"Tsarp"},"comment_text":{"fullyHighlighted":false,"matchLevel":"full","matchedWords":["crawl4ai"],"value":"With stuff like <a href=\"https://www.cloudflare.com/en-in/application-services/products/turnstile/\" rel=\"nofollow\">https://www.cloudflare.com/en-in/application-services/produc...</a> and <a href=\"https://blog.cloudflare.com/ai-labyrinth/\" rel=\"nofollow\">https://blog.cloudflare.com/ai-labyrinth/</a> big money going on both sides last thing you want is to shadow detected as a bot. Its all ok if you are scraping to top rated SEO slop which is usually static sites but for anything beyond it wont work well eventually. Quite a few issues on browerbase, <em>crawl4ai</em> and similar repos around being detected as a bot."},"story_title":{"matchLevel":"none","matchedWords":[],"value":"Show HN: Nxtscape \u2013 an open-source agentic browser"},"story_url":{"matchLevel":"none","matchedWords":[],"value":"https://github.com/nxtscape/nxtscape"}},"_tags":["comment","author_Tsarp","story_44329457"],"author":"Tsarp","comment_text":"With stuff like <a href=\"https:&#x2F;&#x2F;www.cloudflare.com&#x2F;en-in&#x2F;application-services&#x2F;products&#x2F;turnstile&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;www.cloudflare.com&#x2F;en-in&#x2F;application-services&#x2F;produc...</a> and <a href=\"https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;ai-labyrinth&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;ai-labyrinth&#x2F;</a> big money going on both sides last thing you want is to shadow detected as a bot. Its all ok if you are scraping to top rated SEO slop which is usually static sites but for anything beyond it wont work well eventually. Quite a few issues on browerbase, crawl4ai and similar repos around being detected as a bot.","created_at":"2025-06-21T16:55:13Z","created_at_i":1750524913,"objectID":"44338997","parent_id":44338994,"story_id":44329457,"story_title":"Show HN: Nxtscape \u2013 an open-source agentic browser","story_url":"https://github.com/nxtscape/nxtscape","updated_at":"2025-06-21T17:00:57Z"}],"hitsPerPage":20,"nbHits":10602,"nbPages":50,"page":0,"params":"query=crawl4ai&advancedSyntax=true&analyticsTags=backend","processingTimeMS":11,"processingTimingsMS":{"_request":{"roundTrip":25},"afterFetch":{"merge":{"mergeLoop":{"prepareNextHit":4,"total":4},"total":4},"total":4},"fetch":{"query":5,"total":6},"total":11},"query":"crawl4ai","serverTimeMS":12}
