BeautifulSoup gives garbage for html conversion

Question

I am trying to scape this url = 'http://www.jmlr.org/proceedings/papers/v36/li14.pdf url. This is my code

    html = requests.get(url)
    htmlText = html.text
    soup = BeautifulSoup(htmlText)
    print soup #gives garbage

However it gives weird symbols that I think is garbage. It's an html file so it shouldn't be trying to parse it as a pdf should it be?

I tried to the following: How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

    request = urllib2.Request(url)
    request.add_header('Accept-Encoding', 'utf-8') #tried with 'latin-1'too
    response = urllib2.urlopen(request)
    soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

and this too : Python and BeautifulSoup encoding issues

    html = requests.get(url)
    htmlText = html.text
    soup = BeautifulSoup(htmlText)
    print soup.prettify('utf-8')

Both gave me garbage, i.e. not html tags parsed correctly. The last link also suggested encoding might me different despite metaa charset being 'utf8' so I tried the above with 'latin-1' too But nothing seems to work

Any suggestions on how I can scrape the given link for data? Please don't suggest downloading and using pdfminer on the file. Feel free to ask more information!

score 1 · Accepted Answer · answered May 21 '15 at 21:53

1

That's because the URL points to a document in PDF format, so interpreting it as HTML won't make any sense at all.

answered May 21 '15 at 21:53

Benjamin Peterson

19,297
6
32
39

But inspecting any element in the page gives me html code. i.e. highlighting anything and right clicking and inspecting element gives me html tags that render it :/ – Abeer Khan May 21 '15 at 21:55
1

That's probably because your browser is using PDF.js or some similar technology to render the PDF. – Benjamin Peterson May 21 '15 at 21:56
Any suggestions on how to scrape it then? – Abeer Khan May 21 '15 at 22:01
It depends on what you're trying to do. You might try using a Python PDF library like [pypdf](https://pypi.python.org/pypi/pyPdf2) to access the contents of the PDF. – Benjamin Peterson May 21 '15 at 22:03

BeautifulSoup gives garbage for html conversion

1 Answers1