Last Updated on July 6, 2021
Web scraping, or content scraping or web harvesting is the usage of bots or automated programs to extract data from websites. There are various methods and techniques we can use for web scraping, but the basic principle remains the same: fetching the website and extracting data/content from it.
Web scraping by itself is not illegal, but it’s how the web scraper uses the content/data that might be illegal, for example:
- Republishing your unique content: the attacker may repost your unique content elsewhere, negating the uniqueness of your content and may steal your traffic. This can also create a duplicate content issue, which may hinder your site’s SEO performance.
- Leaking confidential information: the attacker may leak your confidential information to the public or your competitor, ruining your reputation or causing you to lose your competitive advantage. Even worse, your competitor might be the one operating the web scraper bot.
- Ruining user experience: web scraper bots can heavily load your server, slowing down your page speed, which in turn may negatively affect your visitor’s user experience.
- Scalper bots: a unique type of web scraper bot can fill shopping carts, rendering products unavailable to legitimate buyers. This may ruin your reputation and may also drive your product’s price higher than it should be.
- Skewed analytics: chances are, you are relying on accurate data analytics such as bounce rate, page views, user demographics data, and so on. Scraper bots can distort your analytics data so you can’t effectively make future decisions.
Those are just some of many more negative impacts that can be caused by web scraping, and this is why it’s very important to prevent the scraping attacks from malicious bots as soon as possible.
How To Prevent Scraping On Your Site
The basic principle in preventing web/content scraping is making it as difficult as possible for bots and automated scripts to extract your data, while not making it difficult for legitimate users to navigate your site and for good bots (even good web scraper bots) to extract your data.
This, however, can be easier said than done, and typically there’ll always be trade-offs between preventing scraping and accidentally blocking legitimate users and good bots.
Below we will discuss some effective methods for preventing scraping of a website:
Frequently update/modify your HTML codes
A common type of web scrapers is called HTML scrapers and parsers, which will extract data based on patterns in your HTML codes. So, an effective tactic to prevent this type of scraping is to intentionally change the HTML patterns, which will render these HTML scrapers ineffective or we can even trick them into wasting their resources.
How to do so will vary depending on your website’s structure, but the idea is to look for HTML patterns that might be exploited by web scrapers.
While this approach is effective, it can be difficult to maintain in the long run, and it might affect your site’s caching. However, it’s still worth trying to prevent HTML crawlers from finding the desired data or content, especially if you have a collection of similar content that might cause the forming of HTML patterns (i.e. a series of blog posts).
Monitor and manage your traffic
You can either check your traffic logs manually for unusual activities and symptoms of bot traffic, including:
- Many similar requests from the same IP address or a group of IP addresses
- Clients that are very fast in filling forms
- Patterns in clicking buttons
- Mouse movements (linear or non-linear)
Once you’ve identified activities from web scraper bots, you can either:
- Challenge with CAPTCHA. However, keep in mind that CAPTCHA may ruin your site’s user experience, and with the presence of CAPTCHA farm services, challenge-based bot management approaches are no longer too effective.
- Rate limiting for example only allows a specific number of searches per second from any IP address. This will significantly slow down the scraper, and might discourage the operator to pursue another target instead.
- If you are 100% positive about the presence of bots, you can block the traffic altogether. However, this isn’t always the best approach since sophisticated attackers might simply modify the bot to bypass your blocking policies.
Alternatively, you can use autopilot bot management software like DataDome that will actively detect the presence of web scraper activities in real-time and mitigate their activities instantly as they are detected.
Honeypots and feeding fake data
Another effective technique is to add ‘honeypot’ to your content or HTML codes to fool the web scrapers.
The idea here is to redirect the scraper bot to a fake (honeypot) page and/or serve fake and useless information to the scraper bot. You can serve up randomly generated articles that look similar to your real articles, so the scrapers can’t distinguish between them, ruining the extracted data.
Don’t expose your dataset
Again, since the goal is to make it as difficult as possible for the web scraper to access and extract data, do not provide a way for them to get all your dataset at once.
For example, don’t have a page listing all your blog posts/articles on a single page, but instead, make them only accessible via your site’s search feature.
Also, make sure you don’t expose any APIs and access points. Make sure you obfuscate your endpoints at all times.
While there isn’t a one-size-fits-all answer to present scraping of a website, the four methods we have shared above are among the most effective in finding the right balance between your site’s user experience for legitimate users and preventing scraping. It’s best to use these four tips in combination while considering which works best for your current needs and requirements.