0

What I am trying to do is to scrape a restaurant using the given URL from the database. The host is https://www.just-eat.co.{tenant}. Then from the response I will get the window.__INITIAL_STATE__ that contains the json.

for resto in restos:
   host = resto['menu_url'].replace('https://', '').split('/')[0]
   headers = {
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                'Accept-Encoding': 'gzip, deflate, br',
                'Accept-Language': 'en-US,en;q=0.9',
                'Cache-Control': 'max-age=0',
                'Connection': 'keep-alive',
                'Content-Type': 'application/json',
                'Host': host,
                'sec-ch-ua': "\"Google Chrome\";v=\"93\", \" Not;A Brand\";v=\"99\", \"Chromium\";v=\"93\"",
                'sec-ch-ua-mobile': '?0',
                'Sec-Fetch-Dest': 'document',
                'Sec-Fetch-Mode': 'navigate',
                'Sec-Fetch-Site': 'same-origin',
                'Sec-Fetch-User': '?1',
                'Upgrade-Insecure-Requests': '1',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
            }

   response = requests.get(url=resto['menu_url'], headers=headers)
   data = re.search('(?<=window.__INITIAL_STATE__=)(.*)(?=<)', response.text).group(1)
   data = json.loads(data)

Here is the problem: When I am scraping set of restaurants, I can gather data from around 5 resto from the start (With full HTML of the page), then suddenly I will get this (HTML below), then suddenly I can gather full HTML again, and so on.

<html>
    <head>
        <META NAME="robots" CONTENT="noindex,nofollow">
        <script src="/_Incapsula_Resource?SWJIYLWA=5074a7">
        </script>
    <body>
    </body>
</html>

Getting this HTML will give me an error because I am trying to access the json with fixed keys. Try-Except is not a solution since I can access the resto URL in the web, unless the page cannot be found. What I want is not to encounter the HTML above, only the HTML that contains window.__INITIAL_STATE__, the full HTML of the page.

<script>window.__INITIAL_STATE__={...

Also, I am using a VPN to access the resto platform since it is block in my country.

What am I missing here? Is it something to do with headers? I copied the header based on the header on the web when trying to access the resto URL.

Tenserflu
  • 520
  • 5
  • 20
  • 1
    Maybe you're making requests too quick. Add a `sleep()` between requests with some random timeout and see if you'll get detected. – Andrej Kesely Sep 24 '21 at 14:10
  • @AndrejKesely Already tried, however, the response is still the same, with expected and being detected as a bot. – Tenserflu Sep 24 '21 at 14:12
  • You can try using proxies. But the result is almost always non-predictable result (some proxies might've already been banned etc.) – Andrej Kesely Sep 24 '21 at 14:14

1 Answers1

1

Possible causes:

1. Scraping too quickly can cause the system to detect you as a bot. Add time.sleep() to slow things down.

2. In my experience, when scraping a site that can detect that you are a bot, it will be checking if you have the cookies that it gives users when they are on the site, so take a look at the cookies it has given you and see if using the same cookies work. There are multiple libraries that work with requests to use cookies. Reference

3. Some websites also check to see if your client has JS enabled which if disabled can cause you to be detected as a bot. Reference

4. Finally, some websites use Cloudflare or other services that detect bots which are very hard to bypass. Just because the screen that says "Checking your browser's IP. Powered by Cloudflare." doesn't show up when entering the site doesn't mean they are not using Cloudflare. cfsrape and cloudscrape modules may work on some sites, usually not, though. Reference

Nimantha
  • 6,405
  • 6
  • 28
  • 69
Rami M
  • 82
  • 5
  • 1
    5. try not to use requests for scraping, but use a proper, remote controlled browser, like with python-playwright. – 576i Sep 24 '21 at 14:38