Getting only data out of HTML page in python

Asked May 07 '13 at 05:26

Active May 07 '13 at 05:26

Viewed 118 times

Im working on a project for which I need the content of an HTML page given ts url.

Im doing something like this

con=urllib.request.urlopen(url)
a=con.read()
con.close()

soup = BeautifulSoup(a)


print(soup.get_text())

But the problem is Im getting all the java script , and other things as well. I just need the displayed content of a webpage. Any pointers on how to go about it?

asked May 07 '13 at 05:26

ashish g

2

`Im getting all the java script`, if the page contents are being loaded via ajax, `BeautifulSoup` wont help you much. Else, try reading only the `` tag. – Bibhas Debnath May 07 '13 at 05:36
If you need to have JS runned, use [phantomjs](http://phantomjs.org/) to fetch the website – Jakub M. May 07 '13 at 05:40
Or selenium, to stay in python – nnaelle May 08 '13 at 14:41

Getting only data out of HTML page in python

0 Answers0