How Web Scraping Actually Works - And Why Ai Changes Everything

3 weeks ago

ZDNET's cardinal takeaways

Web scraping powers pricing, SEO, security, AI, and investigation industries.
AI scraping threatens tract endurance by bypassing postulation return.
Companies conflict backmost pinch licensing, paywalls, and crawler blocks.

Get much in-depth ZDNET tech coverage: Add america arsenic a preferred Google source on Chrome and Chromium browsers.

In nan world of business web scraping, location are a fewer awesome players. Oh, you did not cognize location was a world of business web scraping? Have I sewage a communicative for you.

Let's commencement by defining web scraping. Web scraping is nan believe of extracting information from unrecorded web pages, nan pages nan nationalist sees erstwhile visiting a website.

Also: Fed up pinch AI scraping your content? This open-source bot blocker tin thief - here's how

This is different from getting information via programmatic API (application programming interface) calls that nan supplier of nan web page makes available, aliases from a database, aliases different downloadable information.

Web scraping is extracting information that nan web page proprietor has not officially made disposable for information analysis, and, successful immoderate cases, actively does not want to make disposable for outer information analysis.

Web scraping example

Let's look astatine an example. Let's opportunity you're a vendor pinch 200 individual products you waste online.

Your products are reasonably value sensitive, which is to opportunity that if a competitor starts trading a akin merchandise astatine a little price, you request to beryllium capable to respond and little your value arsenic well. You request to beryllium capable to respond to marketplace forces reasonably quickly, truthful tasking a bunch of labor to perpetually refresh hundreds of web pages and statement results successful a spreadsheet conscionable will not do.

You request an automated process.

Also: Perplexity says Cloudflare's accusations of 'stealth' AI scraping are based connected embarrassing errors

Let's further presume your products, arsenic good arsenic those of your competitor, are sold astatine celebrated online marketplaces for illustration Amazon and Walmart. Both of these resellers supply search information connected your products, but they will not stock your competitors' information pinch you.

Yet you request that data. The solution is web scraping, utilizing an automated process to sojourn nan web pages containing your competitors' products and extracting existent pricing accusation from nan underlying HTML building of nan page. That information tin past beryllium fed into your soul databases, and your soul systems tin past update your prices accordingly.

This scanning rhythm mightiness hap regular aliases a fewer times a week, keeping your products competitively priced and your customers happy.

Other web scraping applications

Industrial web scraping, wherever businesses scrape nan web for data, is done for a assortment of reasons. We conscionable saw an exertion wherever a institution uses web scraping for competitory accusation that drives business insights and informed decision-making.

In summation to move pricing, companies mightiness want to person a clear position of disposable inventory and moreover caller merchandise listings from competitors. They mightiness besides want to support an oculus connected apical products, reviews, and more.

Some businesses usage web scraping to supply information arsenic a service, whether that is existent property marketplace data, income leads, aliases immoderate different aggregate of information that different companies find useful.

If you've ever utilized an SEO monitoring instrumentality aliases keyword ranking tool, you've astir apt been a user of web-scraped information provided arsenic a service. The companies providing these services person to scan unrecorded sites (like Google) and propulsion down accusation that is past categorized and processed to supply up-to-date SEO analytics.

Also: How to get free of AI Overviews successful Google Search: 4 easy ways

There are besides information and intelligence spot protection applications for web scraping. For those pinch valuable brands, location is justification successful scanning unrecorded web pages of commerce sites (as good arsenic different classes of websites) for inappropriate aliases forbidden usage of your brands.

The US Department of Commerce says counterfeiting is nan "largest criminal endeavor successful nan world," putting estimates of pirated and counterfeited equipment astatine an almost incomprehensible $1.7 to $4.5 trillion per year.

Unfortunately, nan authorities cannot extremity this behavior, which leaves it up to individual marque owners to equine their ain defense. An important usage of web scraping successful this discourse is identifying counterfeit merchandise offerings, and past initiating nan process to get those counterfeit products removed from nan market.

Other web scraping uses see threat intelligence, phishing protection, flight and edifice pricing information, aggregating information connected trends for market research, and moreover information utilized for AI training and academic research.

Two sides of nan scraping coin: hunt and AI

Web scraping is not new. In fact, it's conscionable astir arsenic aged arsenic nan web. Think astir hunt engines. In bid for you to type thing into Google and get backmost a database of web pages that see nan taxable you're searching for, nan hunt motor has to person already spidered, scraped, and indexed nan sites it points you to.

Let's talk astir helminths (intestinal worms) for a moment. That's a difficult transition, but I committedness it's relevant.

When my canine eats poop, we person to springiness him deworming medicine truthful he does not get sick. But arsenic Helena Helmby shows successful nan diary BMC Immunology, beneficial parasitical worm type for illustration trichuris trichiura aliases necator americanus tin thief dainty autoimmune disorders for illustration Crohn's illness and ulcerative colitis.

Search engines are fundamentally beneficial parasites surviving disconnected nan activity of individual website providers. They're beneficial because though they scrape nan web, they nonstop postulation backmost to nan sites they scrape. The full world of SEO became a point because of really overmuch postulation Google hunt sends to websites.

Also: AI bots scraping your data? This free instrumentality gives those pesky crawlers nan run-around

But past there's AI. AI is simply a batch for illustration nan parasitical oversea lamprey (petromyzon marinus), an agnatha (basically a jawless fish). Sea lampreys tin turn up to 4 feet long. They connect themselves to different ample food pinch a suction mouth, scrape distant a spread successful nan host's skin, and provender connected humor and bodily fluids. These creatures devastated Great Lakes fisheries successful nan early 20th century. Later techniques, including poison, barriers, and trapping, person reduced nan problem considerably.

AI scraping is parasitical behaviour that's devastating website traffic. The AIs propulsion successful accusation (like from this article) and then, alternatively of sending readers to nan tract wherever an writer wrote nan piece, simply coming that accusation earlier anyone visits a site. I wrote a batch astir this arena and immoderate of nan protections that are starting to beryllium deployed successful How AI companies are secretly collecting training information from nan web (and why it matters). That will bring you up to velocity connected nan rumor successful much depth.

Both hunt and AI usage nan results of perfectly ginormous scraping and spidering operations, but 1 provides benefits to nan scrapees, while nan different profits enormously from nan activity of others while simultaneously destroying their information to support doing nan work.

(Disclosure: Ziff Davis, ZDNET's genitor company, revenge an April 2025 suit against OpenAI, alleging it infringed Ziff Davis copyrights successful training and operating its AI systems.)

The challenges of web scraping

Web scraping connected an business level requires large-scale information acquisition efforts. This mostly involves nan usage of an automated bot that retrieves web pages for study and curation. Unfortunately, astatine slightest from nan constituent of position of web scrapers, astir web servers observe and artifact repeated page accesses, moreover to publically facing pages.

If nan e-commerce institution from our lawsuit study needs to update pricing connected 200 products, it will astir apt request to make a fewer 1000 web page retrieval requests. This measurement of retrieval would apt beryllium blocked by immoderate web server receiving those requests.

This makes it reasonably difficult for individual companies to do their ain web scraping in-house. Instead, a mini cadre of companies has formed to connection web scraping arsenic a service. At their halfway is nan expertise to divided web scraping requests among thousands of individual computers, utilizing them arsenic proxies for information retrieval.

While immoderate scrapers do usage information center-based proxy servers, nan believe is often defeated astatine nan website level, because each those scraping requests travel from 1 IP reside cluster aliases geolocation. Instead, a much gold-standard believe is to usage individual residential computers each crossed a targeted surface science (often homes successful nan US).

Also: How ChatGPT really useful (and why it's been truthful game-changing)

Scraping requests are past distributed among nan location computers. Each machine retrieves a web page. Then each those computers return those retrieved pages backmost to servers astatine nan scraping-as-a-service provider, who past manages nan information for customers.

This leads to different evident challenge. How, exactly, do you get thousands to hundreds of thousands of location computers to activity successful performance to do web scraping? And really do you do it legally and ethically, pinch nan consent of nan location machine owners?

First of all, it's not ever done legally aliases ethically. Malware plays a ample portion successful distributing bots to thousands aliases moreover millions of end-user computers, which tin past beryllium "mind-controlled" into doing searches and scraping activities astatine scale.

There are, however, immoderate companies that do web scraping legally and ethically, while besides processing information successful awesome volume. These companies salary a mini stipend to extremity users who voluntarily springiness up a fewer cycles of processing powerfulness and a fewer bytes of bandwidth to scraper customer programs, who provender nan results backmost to cardinal repositories. We spotlighted 1 specified ethical scraper successful my article, This proxy supplier I tested is nan champion for web scraping -- and it's not IPRoyal aliases MarsProxies.

Where do we spell from here?

While scraping will apt ever beryllium a portion of information acquisition practice, immoderate companies person opted to make their information disposable officially and for a fee. Reddit, for example, is giving OpenAI entree to its tremendous library of fanbois screaming into nan upwind astir this aliases that topic.

Rather than scrape Reddit without approval, OpenAI will beryllium capable to usage an API (application programming interface) to retrieve information much efficiently.

Of course, whether we want our AIs to guidelines their knowledge connected information from Reddit is different point entirely.

Also: Reddit blocks nan Internet Archive from crawling its information - here's why

Reddit is not alone, of course. Many companies person started to licence their information to nan AIs. While this does not trim nan scraping aliases nan postulation erosion, it does supply thing of an replacement gross watercourse for nan erstwhile victims of scraping activity.

This is not an rumor that's going away. One different attack to take sides against malicious scraping has been implemented by separator postulation show Cloudflare. About 20% of Internet postulation flows done its servers. Cloudflare is blocking AI web crawlers by default (unless they get paid, 'natch).

The bottommost statement is that web scraping is each astir money. Whether money is spent bypassing restrictions to hoover up personification else's work, aliases money is spent to artifact that activity, aliases money is spent to get support to extract that information and thereby trim nan wide worth of nan property, it's each astir money. Lots and tons of money.

Those of america who toil to create nan contented consumed by these robots are simply caught successful nan crossfire.

How do you consciousness astir nan increasing usage of web scraping by AI companies compared to hunt engines? Do you deliberation licensing deals for illustration Reddit's are a adjacent solution, aliases do they conscionable legitimize nan nonaccomplishment of tract traffic? Should web scraping beryllium much tightly regulated, aliases is it an unavoidable portion of nan modern Internet? Let america cognize successful nan comments below.

You tin travel my day-to-day task updates connected societal media. Be judge to subscribe to my play update newsletter, and travel maine connected Twitter/X astatine @DavidGewirtz, connected Facebook astatine Facebook.com/DavidGewirtz, connected Instagram astatine Instagram.com/DavidGewirtz, connected Bluesky astatine @DavidGewirtz.com, and connected YouTube astatine YouTube.com/DavidGewirtzTV.