What I am trying to do is to scrape a restaurant using the given URL from the database. The host is https://www.just-eat.co.{tenant}. Then from the response I will get the window.__INITIAL_STATE__
that contains the json.
for resto in restos:
host = resto['menu_url'].replace('https://', '').split('/')[0]
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/json',
'Host': host,
'sec-ch-ua': "\"Google Chrome\";v=\"93\", \" Not;A Brand\";v=\"99\", \"Chromium\";v=\"93\"",
'sec-ch-ua-mobile': '?0',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
}
response = requests.get(url=resto['menu_url'], headers=headers)
data = re.search('(?<=window.__INITIAL_STATE__=)(.*)(?=<)', response.text).group(1)
data = json.loads(data)
Here is the problem: When I am scraping set of restaurants, I can gather data from around 5 resto from the start (With full HTML of the page), then suddenly I will get this (HTML below), then suddenly I can gather full HTML again, and so on.
<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=5074a7">
</script>
<body>
</body>
</html>
Getting this HTML will give me an error because I am trying to access the json with fixed keys. Try-Except
is not a solution since I can access the resto URL in the web, unless the page cannot be found. What I want is not to encounter the HTML above, only the HTML that contains window.__INITIAL_STATE__
, the full HTML of the page.
<script>window.__INITIAL_STATE__={...
Also, I am using a VPN to access the resto platform since it is block in my country.
What am I missing here? Is it something to do with headers? I copied the header based on the header on the web when trying to access the resto URL.