0

I am trying to scape this url = 'http://www.jmlr.org/proceedings/papers/v36/li14.pdf url. This is my code

    html = requests.get(url)
    htmlText = html.text
    soup = BeautifulSoup(htmlText)
    print soup #gives garbage

However it gives weird symbols that I think is garbage. It's an html file so it shouldn't be trying to parse it as a pdf should it be?

I tried to the following: How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

    request = urllib2.Request(url)
    request.add_header('Accept-Encoding', 'utf-8') #tried with 'latin-1'too
    response = urllib2.urlopen(request)
    soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

and this too : Python and BeautifulSoup encoding issues

    html = requests.get(url)
    htmlText = html.text
    soup = BeautifulSoup(htmlText)
    print soup.prettify('utf-8')

Both gave me garbage, i.e. not html tags parsed correctly. The last link also suggested encoding might me different despite metaa charset being 'utf8' so I tried the above with 'latin-1' too But nothing seems to work

Any suggestions on how I can scrape the given link for data? Please don't suggest downloading and using pdfminer on the file. Feel free to ask more information!

Community
  • 1
  • 1
Abeer Khan
  • 21
  • 1
  • 3

1 Answers1

1

That's because the URL points to a document in PDF format, so interpreting it as HTML won't make any sense at all.

Benjamin Peterson
  • 19,297
  • 6
  • 32
  • 39