0

I know this is a header issue because if I take the header out of my code then the html spits out that I am a bot but I can't figure out how to get around this issue even when I add headers. What advice can you give?

import requests
from bs4 import BeautifulSoup


#Get the different pages to begin scraping data from
url = "http://www.manta.com/mb_41_ALL_19/louisiana"
headers = {    'Origin':'http://www.manta.com',
        'Referer':'http://www.manta.com/mb_41_ALL_19/louisiana',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
        , 'Accept-Language':'en-US,en;q=0.8'
        ,'Content-Type':'text/html; charset=utf-8', 'Host':None,}
newurl = requests.get(url, headers=headers)
soup = BeautifulSoup(newurl.text, "html.parser")
print(soup)
Kamikaze_goldfish
  • 856
  • 1
  • 10
  • 24
  • Their CDN is using browser fingerprinting to block bots. It can be done with requests but it means reverse-engineering some javascript. – pguardiario Mar 29 '17 at 00:13

1 Answers1

1

Bad news, look at what you've got at the body:

<div id="distil_ident_block"></div>

The distil is a sign of the "Distil Networks" anti-web-scraping service. And they have their reasons. Quote from the "Terms of Service":

We give you a limited right to access and use Manta. You are not authorized to access Manta or its computers, servers and databases to scrape or “data mine” our data.

Technically, you can challenge Distil, but legally you should not.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Yikes that sucks! Oh well on to the next project :) Thanks – Kamikaze_goldfish Jan 04 '17 at 19:39
  • 1
    @Kamikaze_goldfish they have a way of detecting selenium as well (http://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver) - it can be apparently workarounded, but they'll have more challenges for you. In any case, in this case web-scraping would be illegal. I'd say your best bet would be to be explicit - contact the manta.com owners/maintainers and ask for a better way to get the data, stay on the legal side of things. – alecxe Jan 04 '17 at 19:46