2

The below code got stuck after printing hi in output. Can you please check what is wrong with this? And if the site is secure and I need some special authentication?

from bs4 import BeautifulSoup
import requests

print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl);
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
Shravan Yadav
  • 1,297
  • 1
  • 14
  • 26
  • When you say the "code got stuck", what do you mean? Is there an error? Does it just not do anything? – G. Anderson Dec 14 '18 at 15:35
  • I'm not asking what's happening in the background, I'm asking about the behavior you're seeing. Part of a good [mcve](https://stackoverflow.com/help/mcve) is accurately describing what you are experiencing and how that's different than what you expect. – G. Anderson Dec 14 '18 at 16:50

2 Answers2

4

Unable to read html page from beautiful soup

Why you got this problem is website consider that you are robots, they won't send anything to you. And they even hang up the connection let you wait forever.

You just imitate browser's request, then server will consider you are not an robot.

Add headers is the simplest way to deal with this problem. But something you should not pass User-Agent only(like this time). Remember copy your browser's request and remove the useless element(s) through testing. If you are lazy use browser's headers straightly, but you must not copy all of them when you want to upload files

from bs4 import BeautifulSoup
import requests

rooturl='http://www.hoovers.com/company-information/company-search.html'
with requests.Session() as se:
    se.headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en"
    }
    resp = se.get(rooturl)
print(resp.content)
soup = BeautifulSoup(resp.content,"html.parser")
KC.
  • 2,981
  • 2
  • 12
  • 22
  • it worked but can you please eloborate a little bit. Because I tried the same thing when @chitown88 used user_agent() in headers. – Shravan Yadav Dec 16 '18 at 10:43
  • 1
    Differ from ordinary, i used to add whole browser's request then keep removing unless line by testing. But why i believe that the issue may associate with header is when i disable JavaScript and reload page it change nothing. All in all, if can not respond correctly there are three reason you should know 1. JavaScript 2. request content 3. something service(such as cloudflare) – KC. Dec 16 '18 at 10:54
  • And through the request content and it doesn't use https, i consider the reason will not be situation 3. So what i need is try to imitate the request content. – KC. Dec 16 '18 at 10:59
1

Was having the same issue as you. Just sat there. I tried by adding user-agent, and it pulled it realtively quickly. Don't know why that is though.

from bs4 import BeautifulSoup
import requests


headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl, headers=headers)
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)

EDIT: So odd. Now it's not working for me again. It first didn't work. Then it did. Now it doesn't. But there is another potential option with the use of Selenium.

from bs4 import BeautifulSoup
import requests
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('http://www.hoovers.com/company-information/company-search.html')

r = browser.page_source
print('hi1')
soup=BeautifulSoup(r,"html.parser")
print('hi2')
print(soup)

browser.close() 
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • 1
    Interesting, tried to do the same but it still waits. Maybe cause I first tried without it to reproduce the problem. – alecxe Dec 14 '18 at 15:36
  • 1
    They are probably detecting crawlers and blocking them – Alex W Dec 14 '18 at 15:38
  • I have used user_agent from user_agent import generate_user_agent headers = {'User-Agent': generate_user_agent()} Still no luck – Shravan Yadav Dec 14 '18 at 16:18
  • I'm sorry. I really don't know what the issue is then. Worked for me. Hopefully someone has further insight? I'll keep searching though – chitown88 Dec 14 '18 at 18:53
  • 1
    You are almost right, and if this is not a dynamic page you do not need selenium. – KC. Dec 15 '18 at 07:44
  • hey it worked after downloading the chromedriver.exe and using this line in your second option. Added path of chromedriver.exe browser = webdriver.Chrome("E:\GitHub\chromedriver_win32\chromedriver.exe") – Shravan Yadav Dec 16 '18 at 10:37