1

I want to write webscraper to collect titles of articles from Medium.com webpage.

I am trying to write a python script that will scrape headlines from Medium.com website. I am using python 3.7 and imported urlopen from urllib.request. But it cannot open the site and shows

 "urllib.error.HTTPError: HTTP Error 403: Forbidden" error. 
from bs4 import BeautifulSoup
from urllib.request import  urlopen

webAdd = urlopen("https://medium.com/")
bsObj = BeautifulSoup(webAdd.read())
Result = urllib.error.HTTPError: HTTP Error 403: Forbidden

Expected result is that it will not show any error and just read the web site.

But this does not happen when I use requests module.

import requests 
from bs4 import BeautifulSoup 
url = 'https://medium.com/' 
response = requests.get(url, timeout=5)

This time around it works without error.

Why ??

3 Answers3

4

Urllib is pretty old and small module. For webscraping, requests module is recommended. You can check out this answer for additional information.

Murtaza Haji
  • 1,093
  • 1
  • 13
  • 32
3

Many sites nowadays check where the user agent is coming from, to try and deter bots. requests is the better module to use, but if you really want to use urllib, you can alter the headers text, to pretend to be Firefox or something else, so that it is not blocked. Quick example can be found here:

https://stackoverflow.com/a/16187955

import urllib.request

user_agent = 'Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'

url = "http://example.com"
request = urllib.request.Request(url)
request.add_header('User-Agent', user_agent)
response = urllib.request.urlopen(request)

You will need to alter the user_agent string with the appropriate versions of things too. Hope this helps.

ptd
  • 3,024
  • 16
  • 29
Nick H
  • 1,081
  • 8
  • 13
  • As a note, it seems like you don't need to necessarily fake out a browser's user agent. Just doing `request.add_header('User-Agent', 'my app name')` appears to work for medium. – ptd Jun 05 '19 at 15:29
  • Thanks a lot!! Would you guys tell me how did you all learn about all these! Is there a book? This is so next level code!!! – user7360021 Jun 05 '19 at 16:40
  • It comes with practice. Pretty soon you will encounter dynamic sites which require you to render Javascript. `Selenium` is a webdriver you could use to achieve that. – Murtaza Haji Jun 05 '19 at 21:12
0

this worked for me

import urllib 
from urllib.request import urlopen
html = urlopen(MY_URL)
contents = html.read()
print(contents)
alex
  • 1,757
  • 4
  • 21
  • 32