-1

I get the error "urllib.error.HTTPError: HTTP Error 403: Forbidden" when scraping certain pages, and understand that adding something like hdr = {"User-Agent': 'Mozilla/5.0"} to the header is the solution for this.

However I can't make it work when the URL's I'm trying to scrape is in a separate source file. How/where can I add the User-Agent to the code below?

from bs4 import BeautifulSoup
import urllib.request as urllib2
import time

list_open = open("source-urls.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")

i = 0
for url in line_in_list:
    soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html.parser')
    name = soup.find(attrs={'class': "name"})
    description = soup.find(attrs={'class': "description"})
    for text in description:
        print(name.get_text(), ';', description.get_text())
#        time.sleep(5)
    i += 1
Dharman
  • 30,962
  • 25
  • 85
  • 135
Espen
  • 147
  • 4
  • 16
  • Have you tried reading the `urllib` docs? Or maybe using something more user-friendly like [`requests`](http://docs.python-requests.org/en/master/)? – MattDMo Jan 07 '17 at 23:37
  • Yes, but I still can't get it work.. If I add the variable `hdr = {"User-Agent': 'Mozilla/5.0"}` and change the soup-line to `soup = BeautifulSoup(urllib2.urlopen(url, headers=hdr).read(), 'html.parser')` Python gives me an unexpected agrument on the `headers` word. Any idea? Thanks – Espen Jan 08 '17 at 00:07
  • You didn't read my comment. **1.** Read the [relevant documentation](https://docs.python.org/2/library/urllib2.html#urllib2.urlopen) before asking a question - in this case, the function has no `headers` parameter. **2.** As I said, and as [the docs say](https://docs.python.org/2/library/urllib2.html), you should be using `requests` instead. The only reason requests isn't in the std lib is because it is still under active development, and the maintainers didn't want to be dependent on the Python release schedule. Use it. Your life will be easier. – MattDMo Jan 08 '17 at 16:37
  • I read through the documentation, but didn't quite catch the point. Programming isn't what I can best, but I'm still learning. Thanks! – Espen Jan 08 '17 at 22:31
  • Looks similar to [urllib2.HTTPError: HTTP Error 403: Forbidden](https://stackoverflow.com/questions/13055208/httperror-http-error-403-forbidden/13055444#13055444) – Supreet Sethi Nov 06 '17 at 17:56

1 Answers1

3

You can achieve same using requests

import requests
hdrs = {'User-Agent': 'Mozilla / 5.0 (X11 Linux x86_64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 52.0.2743.116 Safari / 537.36'}    
for url in line_in_list:
    resp = requests.get(url, headers=hdrs)
    soup = BeautifulSoup(resp.content, 'html.parser')
    name = soup.find(attrs={'class': "name"})
    description = soup.find(attrs={'class': "description"})
    for text in description:
        print(name.get_text(), ';', description.get_text())
#        time.sleep(5)
    i += 1

Hope it helps!

Om Prakash
  • 2,675
  • 4
  • 29
  • 50