Python: urllib urlopen stuck, timeout error

Question

as the title states, urlopen get's stuck in the openning of a URL.

The Code:

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client

page_url = "https://store.hp.com/us/en/pdp/hp-laserjet-pro-m404n?jumpid=ma_weekly-deals_product-tile_printers_3_w1a52a_hp-laserjet-pro-m404"

uClient = uReq(page_url)

# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")

uClient.close()

print(page_soup)

The Issue: It get's stuck on uReq. However if you were to replace page_url with the following link, everything works just fine.

page_url= "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"

The Error: Timeout Error

How can i achiev the opening of a given URL, for Web Scraping purposes?

EDIT

score 0 · Answer 1 · answered Mar 01 '20 at 17:04

0

Some of website require User-Agent header to produce successful request. Import urllib.request.Request and modify your code as follows

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq, Request  # Web client

page_url = "https://store.hp.com/us/en/pdp/hp-laserjet-pro-m404n?jumpid=ma_weekly-deals_product-tile_printers_3_w1a52a_hp-laserjet-pro-m404"

uClient = uReq(Request(page_url, headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'
}))

# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")

uClient.close()

print(page_soup)

and you'll be fine

answered Mar 01 '20 at 17:04

xxMrPHDxx

657
5
11

Check the Edit, i tried your version an there comes out an error – Mar 01 '20 at 18:17
@Roman You'll need to decode since it has Unicode characters `uClient.read().decode('utf-8')` (Note: `urlopen().read()` returns bytes) – xxMrPHDxx Mar 01 '20 at 18:21
at the page_soup ? page_soup = soup(uClient.read().decode('utf-8'), "html.parser") - i tried it out in a lot of places and seems to not work out – Mar 01 '20 at 18:41
@Roman You might want to take a look at [this](https://stackoverflow.com/questions/14630288/unicodeencodeerror-charmap-codec-cant-encode-character-maps-to-undefined) then and see if it could resolve the issue – xxMrPHDxx Mar 02 '20 at 01:07

Python: urllib urlopen stuck, timeout error

1 Answers1