1. Preface#
For a webpage, we often hope that it is well-structured and clear in content, so that search engines can accurately recognize it. On the other hand, there are also some situations where we do not want the content to be easily accessible, such as the transaction amount of an e-commerce website or the titles of an educational website. Because these contents are often the lifeblood of a product and must be effectively protected. This is the origin of the topic of web scraping and anti-scraping.
2. Common Anti-Scraping Strategies#
However, there is no website in the world that can achieve perfect anti-scraping.
If a page wants to be displayed normally in front of users while not giving opportunities to scrapers, it must be able to distinguish between humans and robots. Therefore, engineers have made various attempts, and these strategies are mostly used in the backend and are currently the most common and effective means, such as:
- User-Agent + Referer detection
- Account and cookie verification
- Captcha
- IP frequency restriction
But scrapers can come very close to humans, for example:
- Using Chrome headless or PhantomJS to simulate a browser environment
- Using Tesseract to recognize captchas
- Proxy IPs can be bought on Taobao
So we say, 100% anti-scraping strategy? It doesn't exist.
It's more of a physical activity, a matter of difficulty.
However, as frontend engineers, we can increase the difficulty a bit and design some interesting anti-scraping strategies.
3. Frontend and Anti-Scraping#
3.1 Font-face Collage#
Example: Maoyan Movies
In Maoyan Movies, for box office data, it is not displayed as pure numbers.
The page uses font-face to define character sets and maps them using Unicode for display. In other words, apart from image recognition, the character set must be scraped in order to recognize the numbers.
3.2 Background Collage#
Example: Meituan
Similar to the font strategy, Meituan uses background collage. The numbers are actually images, and different backgrounds display different characters based on the offset.
3.3 Character Interleaving#
Example: WeChat Official Account Article
In some articles of WeChat official accounts, various mysterious characters are interspersed and hidden using styles. This method, although shocking, does not have much difficulty in identification and filtering. It can even be done better, but it can be considered as a kind of creativity.
3.4 Pseudo-element Hidden#
Example: Autohome
In Autohome, important manufacturer information is placed in the content of pseudo-elements. This is also a train of thought: to scrape the webpage, CSS must be parsed, and the content of pseudo-elements must be obtained, which increases the difficulty for scrapers.
3.5 Element Position Overlay#
Example: Qunar
Qunar, which loves mathematics, uses four i
tags to render a 4-digit ticket price, and then uses two b
tags to absolutely position the offset to cover up the intentionally incorrect i
tags, visually forming the correct price.
3.6 Iframe Asynchronous Loading#
Example: NetEase Cloud Music
When NetEase Cloud Music page opens, there is almost only one iframe
in the HTML source code, and its src is blank: about:blank
. Then JavaScript starts running and asynchronously inserts the entire page framework into the iframe...
3.7 Character Segmentation#
Example: Global Proxy IP
On some pages that display proxy IP information, protecting the IP is also a challenge.
They first divide the IP numbers and symbols into DOM nodes, and then insert confusing numbers in the middle. If the scraper is not aware of this strategy, it may think it has successfully obtained the value. However, if the scraper notices it, it can be easily solved.
3.8 Character Set Replacement#
Example: Qunar Mobile
Qunar's mobile version also deceives scrapers.
The HTML clearly writes 3211, but visually displays 1233. They redefine the character set, and the order of 3 and 1 is just swapped to achieve this result.