I am trying to scape this url = 'http://www.jmlr.org/proceedings/papers/v36/li14.pdf url. This is my code
html = requests.get(url)
htmlText = html.text
soup = BeautifulSoup(htmlText)
print soup #gives garbage
However it gives weird symbols that I think is garbage. It's an html file so it shouldn't be trying to parse it as a pdf should it be?
I tried to the following: How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8') #tried with 'latin-1'too
response = urllib2.urlopen(request)
soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
and this too : Python and BeautifulSoup encoding issues
html = requests.get(url)
htmlText = html.text
soup = BeautifulSoup(htmlText)
print soup.prettify('utf-8')
Both gave me garbage, i.e. not html tags parsed correctly. The last link also suggested encoding might me different despite metaa charset being 'utf8' so I tried the above with 'latin-1' too But nothing seems to work
Any suggestions on how I can scrape the given link for data? Please don't suggest downloading and using pdfminer on the file. Feel free to ask more information!