By Mike Khorev
Web scraping is the process of extracting data or content from a website. When we right-click on an image of a website and save it, we are technically performing a web scraping, but scraping can also be performed with programs or bots. These automated tools can extract much more data at a much faster rate, which will translate into less cost.
With that being said, web scraping (also called screen scraping, website data extraction, and web harvesting) can be harmless and completely legal. In some cases, it can also be used for malicious purposes, burdening the website’s server and even leading to other forms of cybercriminal attacks.
A web scraper program may send many more requests that a typical human user and perpetrators can technically use to execute a DDoS (Distributed Denial of Service) attack. While it’s legal for web scrapers to extract publicly available data, some web scraper programs can also bypass the target website’s security and steal sensitive data that is supposed to be hidden (i.e., website user’s financial information).
A web scraper program can be reasonably simple and very sophisticated at the same time. We can even build our own web scraper bot, although it will require relatively advanced programming knowledge.
The more advanced the web scraper is, the harder it will be to build it. Various pre-built web scraper programs are widely available, some of them for free. Some of these are pretty sophisticated with multiple advanced features.
This availability of pre-built web scrapers is why it’s essential for website owners to understand the concept of web scraping, how to identify web scraper activities, and how to block their activities when required.
Web Scraping Protection Methods
- Securing Your Website
Here are some basic web scraping prevention methods you should implement on your website:
“You may only use or reproduce the content on this site for personal and non-commercial use”.
Doing this might not stop hackers with malicious intent, but will stop those with honest intent and give you a legal advantage.
- Prevent hotlinking
Hotlinking is displaying resources (images, videos, or other files) on other websites while using the original site’s resources. During web scraping, it’s a common practice to copy links and images directly. When you prevent hotlinking when images are displayed on other sites, they won’t use your server’s resources. While this won’t stop others from stealing and using your content, at the very least, you can mitigate the damage.
- Use cross-site request forgery (CSRF) tokens
Implementing CSRF tokens on your website can help prevent automation bots and other automation software from making arbitrary requests on your site URL(s).
A CSRF token is essentially a unique and secret number generated by the web server and transmitted to the client when the client performed an HTTP request. When the client makes another request, the server-side application will check whether this request includes the CSRF token, and will reject the request if the token is missing.
To get around a CSRF token, the web scraper bot must search the right token before bundling it with the request, and only more sophisticated scraper programs can do this.
- Monitor Your Traffic and Limit Unusual Activity
The best way to prevent web scraper is to set up a monitoring system in place. So, when your system detects unusual activities that indicate web scraper bots’ presence, you can block or limit the activity.
Here are some common practices to try:
- Rate limiting:
Fairly self-explanatory. You can limit web scrapers (and also legitimate users) to just a limited number of actions in a specific time frame. For example, you can only allow a particular number of searchers per second or minute from any particular user (or IP address). Doing this will significantly slow down web scraper bots’ activities.
However, if you rate limit or block traffic, you should go beyond IP address detection. Here are some indicators that can help you identify scraper bots:
- Linear mouse movement and licks
- Very fast form submissions
- Checking browser types, screen resolutions, timezone, etc. to identify the presence of bad bots
In the case of a shared internet connection, you might get requests from legitimate users from the same IP addresses. Identifying other factors besides IP addresses can help you discern between real human users and web scrapers.
- Require account creation
Asking users to register and log in before they can access your content can be a good prevention measure for web scrapers. Still, it will also affect user experience (UX) and might deter legitimate users. So, use this sparingly.
Also, some sophisticated web scrapers can register and automatically login with their accounts and even create multiple accounts. A good practice here is to require an email address for registration and verify that email address. You can also implement a CAPTCHA test (more on this below) to prevent web scraper bots from creating accounts.
- Using CAPTCHAS
CAPTCHA (“Completely Automated Test to Tell Computers and Humans Apart”) is an effective measure against web scrapers and automated scripts (bots) in general.
The main idea of using a CAPTCHA is that the test should easy enough (or very easy) to solve by human users but very difficult to solve by bots. You can include CAPTCHA in your sensitive pages, or you can show CAPTCHA only when your system detects a possible scraper and wants to stop the content scraping.
There are various easy ways to implement CAPTCHA on your site. For example, Google’s reCAPTCHA is a free and reliable way to add CAPTCHAs on your website. Although Google’s Recaptcha is fairly reliable, it is not perfect, and it does have several weaknesses.
One thing to remember is not to include the solution to the CAPTCHA on your page in any form. Some websites made the mistake of including the solution in the HTML markup on the page itself. Scrapers can simply scrap this and use this to bypass the CAPTCHA.
Keep in mind that there are various ways for web scrapers to bypass CAPTCHAs. For example, there are various CAPTCHA farm services where real humans are paid to solve the CAPTCHA, rendering it useless. In such cases, we will need to pair the CAPTCHA with an advanced bot detection software that can detect CAPTCHA farms and bots dedicated to solving CAPTCHAs and reCAPTCHAs.
- Don’t expose your whole content and API endpoints
Don’t provide a way for the script or bot to access all your content on one page. For example, don’t list all your blog posts on a directory page, but you could make these posts only accessible via your site’s on-site search.
This way, the web scraper must search for all possible phrases to find all your articles, which will be very difficult and time-consuming even for the most sophisticated scrapers. Hopefully, the scraper will give up due to this simple measure.
However, searching for something like “the”, “and”, or other generic keywords might reveal almost all your content. You can tackle this issue by only showing 10 or 20 of the results.
Make sure that you don’t expose any APIs and especially API endpoints. Scrapers can reverse engineer this and use it in the scraper script. Instead, make sure all your API endpoints are hard for others to use.
- Secure Your Pages
We have mentioned that you should require logins for specific content to avoid content scraping. This way, you can block automated bots, and even when they can log in, you can accurately track their actions. You can also ban this account when you detect scraping activities.
While requiring registrations and logins won’t 100% stop the content scraping, but it will at least give you insights and control.
Some tips you can use here:
- Change up your HTML regularly
A common practice in web scrapers is to find patterns and possible exploits in a site’s HTML markup. Perpetrators can then use these patterns to launch further attacks on your site’s HTML by exploiting vulnerabilities.
With that being said, consider changing your site’s HTML markup frequently, or ensuring that your site’s markup is non-uniformed or inconsistent. Doing this might discourage the attackers that can’t effectively find your site’s HTML patterns.
These changes don’t necessarily mean you’d need to redesign your site altogether, but you can simply change the ID and class in your HTML and CSS files regularly.
- Creating honey pot or trap pages
Honey pot pages are hidden pages or hidden elements on a page (i.e., a hidden link) that your average human visitor wouldn’t click. Web scrapers tend to click every link on a page, so they will accidentally click on this link and enter the ‘trap’. You can, for example, disguise a link to blend in with the page’s background.
When a specific visitor visits this honey pot page, we can almost be sure that this is not a human visitor, and you can monitor the activity. If required, you can limit or even block all requests from this client.
Web scrapers can eat up your website’s resources, and it can also lead to other malicious activities like stolen content to Layer 7 DDoS attacks. So, a fight between a website owner and a scraper is often a lengthy and continuous one: a website owner must always stay ahead of the hackers and scrapers to prevent malicious content scraping and other cybersecurity threats.
Above, we have discussed several possible solutions in preventing web scraping, but they are still not a 100% prevention measure against advanced and sophisticated web scraper bots. It’s best to always stay careful and monitor your traffic so you can identify malicious traffic and throttle/block it as soon as possible.
Mike Khorev is passionate about all emerging technologies in the IT space and loves to write about all of them. He is a lifetime marketing and internet expert with over 10 years of experience in web technologies, SEO, online marketing and cybersecurity.