BeautifulSoup : Unable to scrape the content

Question

I am unable to extract the content from this website. I have tried adding different headers, but I still not able to scrap data from this website.

import requests
from bs4 import BeautifulSoup

seedURL = 'https://www.owler.com/location/new-york-companies?p=2'

# headers = requests.utils.default_headers()
# headers.update({
#     'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
# })
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0'}

req_content = requests.get(seedURL, headers=headers)
data = BeautifulSoup(req_content.content,"lxml")
print(data)

This is the response that I get

<!DOCTYPE html>
<html>
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="max-age=0" http-equiv="cache-control"/>
<meta content="no-cache" http-equiv="cache-control"/>
<meta content="0" http-equiv="expires"/>
<meta content="Tue, 01 Jan 1980 1:00:00 GMT" http-equiv="expires"/>
<meta content="no-cache" http-equiv="pragma"/>
<meta content="10; url=/distil_r_captcha.html?requestId=c4aceb58-d5b5-480d-a09f-dafd9cca7cbe&amp;httpReferrer=%2Flocation%2Fnew-york-companies%3Fp%3D2" http-equiv="refresh"/>
<script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script>
<script defer="" src="/owlerdstl.js" type="text/javascript"></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#dqfubwvfuxfsxffus{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock"> </div>
</body>
</html>

That website uses bot mitigating services. You'll need to change more than user agent to be able to scrape it. — Evya, Jan 16 '18 at 12:16
Maybe [this](https://stackoverflow.com/a/33403473/8306850) answer will help clarify. — Evya, Jan 16 '18 at 12:20
Go for selenium to get what you wish to. This is not a solution but a workaround. — SIM, Jan 16 '18 at 12:58
@Topto I tried with Selenium, still getting blocked from scrapping — joel, Jan 16 '18 at 13:02

score 2 · Accepted Answer · answered Jan 16 '18 at 13:15

Try this. It should let you fetch the content you are after:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("https://www.owler.com/location/new-york-companies?p=2")

for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".company-details"))):
    company_name = item.find_element_by_id("company-name-1").text
    ceo_name = item.find_element_by_id("ceo-name-1").text
    print(company_name,ceo_name)

driver.quit()

Partial output:

Mercer Julio A. Portalatin
Thomson Reuters James C. Smith
Bloomberg, L.P. Michael R. Bloomberg
American Express Co Kenneth I. Chenault

BeautifulSoup : Unable to scrape the content

1 Answers1