9

Incapsula is a web application delivery platform that can be used to prevent scraping.

I am working in Python and Scrapy and I found this, but it seems to be out-of-date and not working with current Incapsula. I tested the Scrapy middleware with my target website and I got IndexErrors owing to the fact that the middleware was unable to extract some obfuscated parameter.

Is it possible to adapt this repo or has Incapsula now changed in its mode of operation?

I'm curious also as to how I can "copy as cURL" the request in from chrome dev tools to my target page, and the chrome response contains the user content, yet the curl response is an "incapsula incident" page. This is for chrome with cookies initially cleared.....

curl 'https://www.radarcupon.es/tienda/fotoprix.com' 
-H 'pragma: no-cache' -H 'dnt: 1' -H 'accept-encoding: gzip, deflate, br' 
-H 'accept-language: en-GB,en-US;q=0.9,en;q=0.8' 
-H 'upgrade-insecure-requests: 1' 
-H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/62.0.3202.94 Chrome/62.0.3202.94 Safari/537.36' 
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' 
-H 'cache-control: no-cache' -H 'authority: www.radarcupon.es'
 --compressed

I was expecting the first request from both to return something like a javascript challenge, which would set a cookie, but it doesn't seem to quite work like that now?

fpghost
  • 2,834
  • 4
  • 32
  • 61
  • 2
    It uses javascript so you either need to use Splash or Selenium. I would recommend Splash if site doesn't detect it's old version of webkit through fingerprinting(probably will). Otherwise use Selenium. Even adding in the right headers its still possible to detect bots with various browser settings, screen display, fingerprinting... So that explains why your curl wont work. Selenium will be slow but sure. – eusid Nov 30 '17 at 20:40
  • 2
    But my point is how are they doing it that from the very first request. All the server will see is the same packet. Surely they have to initially send me a response loaded with a javascript, which would then do the fingerprint or challenge before further content was delivered. – fpghost Dec 01 '17 at 02:41
  • well good luck getting someone on so to tell you how to break a security measure. I don't know how it works. If there is some reason u can't use selenium? because that is the shortcut to success. – eusid Dec 01 '17 at 15:52
  • I think I probably could use selenium for my purposes. I was also just curious about how incapsula works – fpghost Dec 06 '17 at 17:34
  • have a look at [this](https://github.com/ziplokk1/incapsula-cracker-py3.git), I didn't try it though – Evhz Apr 14 '18 at 21:08

4 Answers4

6

Incapsula, like many other anti-scraping services, uses 3 types of details to identify web scrapers:

  1. IP address meta information
  2. Javascript Fingerprinting
  3. Request analysis

To get around this protection, we need to ensure that these details match that of a common web user.

IP Addresses

A natural web user is usually connected from a residential or mobile IP address where many production scrapers are deployed on datacenter IP addresses (Google cloud, AWS etc.). These 3 types are very different and can be determined by analysis of IP databases. As the name implies: datacenter - commercial IP addresses, residential - household addresses, and mobile ones are cell tower-based mobile networks (3G, 4G, etc.)

So, we want to distribute our scraper network through a pool of residential or mobile proxies.

Javascript Fingerprinting

Using javascript, these services can analyze the browser environment and build a fingerprint. If we are running browser automation tools (like Selenium, Playwright, or Puppeteer) as web scrapers, we need to ensure that the browser environment appears to be user-like.

This is a huge subject, but a good start would be to take a look at what puppeteer-stealth plugin which applies patches to browser environment to hide various details that reveal the fact that the browser is being controlled by a script.

Note: puppeteer-stealth is incomplete and you need to do extra work to get pass Incapsula reliably.

SO answer is a bit short to cover this, but I wrote an extensive introduction on this subject on my blog How to Avoid Web Scraping Blocking: Javascript

Request Analysis

Finally, the way our scraper connects plays a huge role as well. Connection patterns can be used to determine whether the client is a real user or a bot. For example, real users usually navigate the website in more chaotic patterns like going to the home page, category pages, etc.

A stealthy scraper should introduce a bit of chaos into scraping connection patterns.

Curl is not going to cut it

As you're asking about using CURL since Incapsula relies on JS fingerprinting, you won't have much luck in this scenario. However, there are few things to note that might help with other systems:

  • HTTP2/3 protocol will have much higher success rate. Curl and many other http clients default to http 1.1 and majority of real user traffic runs http2+ - it's a dead giveaway.
  • Header values and ordering matters too as real browsers (Chrome, Firefox etc.) have specific header order and values. If your scraper connection differs - it's a dead giveaway.

Understanding these 3 details that differentiate bot traffic from real human traffic can help us to develop more stealthy scrapers. I wrote more on this subject on my blog if you'd like to learn more How to Scrape Without Getting Blocked

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
0

It's difficult to give a specific answer because Incapsula has a very detailed rules engine that can be used to block or challenge requests. Cookie detection and Javascript support are the two most common data points used to identify suspicious traffic; user agent strings, headers, and behavior originating from the client IP address (requests per minute, AJAX requests, etc) can also cause Incapsula to challenge traffic. The DDoS protection feature blocks requests aggressively if it is not configured sensibly relative to the amount of traffic a site sees.

Scott Simontis
  • 748
  • 1
  • 10
  • 24
0

There could be multiple reasons. It's hard to pin-point exactly what combination of rules Incapsula is applying to detect you as a bot. It could be using IP rate limitation, Browser fingerprinting, Header Validation, TCP/IP Fingerprinting, User-agent etc...

But you can try

  • Rotating IPs.

    You can easily find lists of free proxies on the internet, and you can use a solution like scrapy-rotating-proxies middleware to configure multiple proxies in your spider and have requests rotate through them automatically.

  • Rotating USER_AGENT.

    One way to navigate this filter is to switch your USER_AGENT to a value copied from those that popular web browsers use. In some rare cases, you may need a user agent string from a specific web browser. There are multiple Scrapy plugins that can rotate your requests through popular web browser user agent strings, such as scrapy-random-useragent or Scrapy- UserAgents.

  • You can try inspecting developer tools and reverse engineer the request parameters.

Mostly in such scenarios, the objective is to avoid getting banned by crawling with best practises in mind. you can read about them here. or you can try using dedicated tools for the same like Smart Proxy Manager or Smart Browser too. I work as a Developer Advocate @Zyte.

cigien
  • 57,834
  • 11
  • 73
  • 112
  • 1
    When linking to your own site or content (or content that you are affiliated with), you [must disclose your affiliation _in the answer_](/help/promotion) in order for it not to be considered spam. Having the same text in your username as the URL or mentioning it in your profile is not considered sufficient disclosure under Stack Exchange policy. – cigien Jun 16 '22 at 16:09
  • Thanks @cigien, This is helpful :) Can you share some readings/tips which can be helpful who are starting new on Stack overflow? – Neha Setia Nagpal Jun 20 '22 at 07:00
  • Taking the [tour] is helpful, which I assume you've already done. You can take a look at [help], which has a lot of useful information. [meta] is also a good resource, but there's quite a lot of information there, so don't worry about understanding all of it at once. – cigien Jun 20 '22 at 15:04
0

I ran into the same issue with scraping an Incapsula site and for whatever reason, this actually worked:

try:
        results = scrape_data(url)
except:
        results = scrape_data(url)

The site is the California SDWIS portal for water quality data and I'm just grabbing the data from the site pages to validate some info. Until recently, there was no issue scraping everything but then they changed it to start throwing up the errors you mentioned in your question after the first several hundred pages were scraped.

If you're wondering why such a hacky and simple solution seems to work, that's a very good question. My favorite theory is that they didn't QA test it to reject a retry attempt and/or retrying has enough randomness in how long you wait that it tricks their bot detection. Of course, it could also be a bug or maybe something about the site requires neutering certain bot detection features.

Basically, that's what worked for me and I have no idea why.