To avoid anti-web scraping services like Datadome, we first should understand how they work, which really boils down to 3 categories of detection:
- IP address
- Javascript Fingerprint
- Request details
Services like Datadome use these tools to calculate a trust score for every visitor. A low score means you're likely to be a bot, so you'll either be requested to solve a captcha or denied access entirely. So, how do we get a high score?
IP Addresses / Proxies
For IP addresses, we want to distribute our load through proxies, and there are several kinds of IP addresses:
- Datacenter: addresses assigned to big corporations like Google Cloud, AWS, etc.
These are awful for your bot score and should be avoided.
- Residential: addresses assigned to living spaces.
These are great for your bot score.
- Mobile: addresses assigned to phone cell towers.
These are just as good as residential or sometimes even better.
So, to maintain a high trust score, our scraper should rotate through a pool of residential or mobile proxies.
Javascript Fingerprint
This topic is way too big for a StackOverflow question, though let's do a quick summary
Websites can use Javascript to fingerprint the connecting client (the scraper) as javascript leaks an enormous amount of data about the client: operating system, support fonts, visual rendering capabilities, etc.
So, for example: if Datadome sees a bunch of Linux clients connecting through 1280x720 windows, then it can simply deduce that this sort of setup is likely a bot and gives everyone with these fingerprint details low trust scores.
If you're using Selenium to bypass Datadome, you need to patch many of these holes to get out of the low trust zone. This can be done by patching the browser itself to fake fingerprinted details like operating system etc.
For more on this, see my blog How to Avoid Web Scraping Blocking: Javascript
Request Details
Finally, even if we have loads of IP addresses and patch our browser from leaking key fingerprint details, Datadome can still give us low trust scores if our connection patterns are unusual.
To get around this, our scraper should scrape in non-obvious patterns. It should connect to non-target pages like the website's homepage once in a while to appear more human-like.
Now that we understand how our scraper is being detected, we can start researching how to get around that. Selenium has a big community and the keyword to look for here is "stealth". For example, selenium-stealth (and its forks) is a good starting point to patching Selenium fingerprint leaks.
Unfortunately, this scraping area is not very transparent, as Datadome can simply collect publicly known patches and adjust their service accordingly. This means you have to figure out a lot of stuff yourself or use a web scraping API to do that for you to scrape protected websites past the first few requests.
I've fitted as much as I can into this answer so for more information see my series of blog articles on this issue How to Scrape Without Getting Blocked