I am reading text from html files and doing some analysis. These .html files are news articles.
Code:
html = open(filepath,'r').read()
raw = nltk.clean_html(html)
raw.unidecode(item.decode('utf8'))
Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?
I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.
I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.
EDIT: To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general