16

I am reading text from html files and doing some analysis. These .html files are news articles.

Code:

 html = open(filepath,'r').read()
 raw = nltk.clean_html(html)  
 raw.unidecode(item.decode('utf8'))

Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?

I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.

I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.

EDIT: To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general

Sachin D
  • 73
  • 1
  • 2
  • 8
Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142

6 Answers6

17

Newspaper is becoming increasingly popular, I've only used it superficially, but it looks good. It's Python 3 only.

The quickstart only shows loading from a URL, but you can load from a HTML string with:

import newspaper

# LOAD HTML INTO STRING FROM FILE...

article = newspaper.Article('') # STRING REQUIRED AS `url` ARGUMENT BUT NOT USED
article.set_html(html)
Harry
  • 3,312
  • 4
  • 34
  • 39
  • 1
    This is a good approach but note that you have to provide a URL to Article() so better to use `Article(url='http://example.com/test-url')`. The article won't need to be downloaded as you then use `article.set_html(html)` to set the html locally. – Alan Buxton Sep 28 '17 at 21:21
  • 3
    You're absolutely right, the API has changed since I wrote this and requires a URL string as a positional argument, it's come up in this issue https://github.com/codelucas/newspaper/issues/291. You can instantiate with a blank string (`article = newspaper.Article('')`) without any problems, that may be clearer than setting a real URL if it won't be downloaded and parsed. – Harry Oct 02 '17 at 19:17
  • @AlanBuxton I have tried that, but im getting the error when i try to parse it (article.parse()), and if i do not parse it I can not extract text and title from article. How can I override that? – taga Dec 04 '20 at 17:14
  • @AlanBuxton, how did you solve above issue ? please – tursunWali Jan 27 '21 at 01:32
  • Hi @tursunWali I have always found that newspaper works very well. Can you give some more info on the issue you are having? – Alan Buxton Jan 28 '21 at 06:16
  • Alan Buxton, I meant this "I have tried that, but im getting the error when i try to parse it (article.parse()), and if i do not parse it I can not extract text and title from article. How can I override that? – taga ". I am using newsletter3k, works with Python3. I want to work with news text I stored locally. How to do it, I wonder. – tursunWali Jan 30 '21 at 22:50
  • @tursunWali `input_html` argument of `download` method could help in your case ... Example (`body` is the html content as a string): `article = Article("https://www.example.com"); article.download(input_html=body); article.parse()` This is the code where you see how `input_html` is processed: https://github.com/codelucas/newspaper/blob/master/newspaper/article.py#L189-L200 – Marek Nov 26 '21 at 21:33
12

There are libraries for this in Python too :)

Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe

If you want to use purely python libraries, there are 2 options:

https://github.com/buriy/python-readability

and

https://github.com/grangier/python-goose

Of the two, I prefer Goose, however be aware that the recent versions of it sometimes fail to extract text for some reason (my recommendation is to use version 1.0.22 for now)

EDIT: here's a sample code using Goose:

from goose import Goose
from requests import get

response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
oxymor0n
  • 1,089
  • 7
  • 15
  • 2
    This is really nice to know library, which also uses NLTK. For those who want to install, on PyPi it goes as goose-extractor, because Goose is for another, unrelated tool . – Roman Susi May 20 '15 at 18:31
  • Can you check the code you provided above. It doesn't seem to work. text is empty. – Abhishek Bhatia May 20 '15 at 18:41
  • 2
    use version 1.0.22 of goose. As i've said, the new versions have some annoying bugs that prevent them from extracting some content, NYTimes being one of them :( – oxymor0n May 20 '15 at 18:42
  • Thanks! This works very well. But at times I get an error : IOError: Couldn't open file C:\Python27\lib\site-packages\goose_extractor-1.0.22-py2.7.egg\goose\resources\text\stopwords-ut.txt . I couldn't find any reference to it online. Can you please help. – Abhishek Bhatia May 20 '15 at 19:26
  • 2
    this is another annoying bug of Goose. They fixed in in recent versions, but those are the versions that couldn't extract from NYTimes :( I've a fork of python-goose that do both, which you can access at https://github.com/agolo/python-goose/ – oxymor0n May 20 '15 at 19:31
  • 1
    actually, use this fork instead: it's a more up-to-date version than mine https://github.com/robmcdan/python-goose – oxymor0n May 20 '15 at 19:40
  • I haven't used git much. Just to clarify I should reinstall Goose using this now instead `git clone https://github.com/robmcdan/python-goose`? – Abhishek Bhatia May 20 '15 at 19:43
  • 1
    uninstall your current Goose, and then do this in the terminal: `pip install git+git://github.com/robmcdan/python-goose.git` – oxymor0n May 20 '15 at 19:45
  • I am using Python 2.7.9 and have tried it times `git clone https://github.com/agolo/python-goose` but the above code doesn't work. It doesn't print anything `print text.encode('ascii','ignore')`. Can you please check. – Abhishek Bhatia May 20 '15 at 22:37
  • 1
    try the new code, which use `requests` to fetch the html instead of relying on Goose's internal function – oxymor0n May 20 '15 at 22:46
  • Thanks again! Please check it doesn't work for this http://tribune.com.pk/story/773657/pti-multan-administration-trade-blame-after-8-people-killed-in-qasim-bagh-stampede/ – Abhishek Bhatia May 20 '15 at 22:57
  • Oh sorry! It was mistaken. There is a one minute the text I get is devoided of any " ". As you will notice in the above link. In my case, I especially I want to use such text. Is there any way to fix it? – Abhishek Bhatia May 20 '15 at 23:01
  • i don't really get what you are saying. the output that i get looks fine. – oxymor0n May 20 '15 at 23:09
  • I wanted to ask you about the encoding of the final text. Do I need perform some preprocessing on it? – Abhishek Bhatia May 20 '15 at 23:10
  • It seems at times I have use .encode('ascii','ignore') but this removes " " in the text sometimes. – Abhishek Bhatia May 20 '15 at 23:11
  • this is a whole other discussion re: how to handle unicode in python 2, which is not the scope of this thread. I suggest you read up on that first (a good resource is http://nedbatchelder.com/text/unipain.html), and make new questions if you still can't make it work. – oxymor0n May 20 '15 at 23:13
  • and yes, some of the spaces in the output (which is unicode) are non-breaking space characters, which will get removed if you use `.encode('ascii','ignore')` – oxymor0n May 20 '15 at 23:16
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/78365/discussion-on-answer-by-oxymor0n-extract-news-article-content-from-stored-html). – Taryn May 21 '15 at 01:13
  • @oxymor0n Can you please help me with this question http://stackoverflow.com/questions/30381944/read-article-content-using-goose-retrieving-nothing – Abhishek Bhatia May 21 '15 at 23:59
1

Try something like this by visiting the page directly:

##Import modules
from bs4 import BeautifulSoup
import urllib2


##Grab the page
url = http://www.example.com
req = urllib2.Request(url)
page = urllib2.urlopen(req)
content = page.read()
page.close()  

##Prepare
soup = BeautifulSoup(content) 

##Parse (a table, for example)

for link in soup.find_all("table",{"class":"myClass"}):
    ...do something...
pass

If you want to load a file, just replace the part where you grab the page with the file instead. Find out more here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

datasci
  • 1,019
  • 2
  • 12
  • 29
1

There are many ways to organize html-scaraping in Python. As said in other answers, the tool #1 is BeautifulSoup, but there are others:

Here are useful resources:

There is no universal way of finding the content of the article. HTML5 has article tag, hinting on the main text, and it is maybe possible to tune scraping for pages from specific publishing systems, but there is no general way to get the accurately guess text location. (Theoretically, machine can deduce page structure from looking at more than one structurally identical, different in content articles, but this is probably out of scope here.)

Also Web scraping with Python may be relevant.

Pyquery example for NYT:

from pyquery import PyQuery as pq
url = 'http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general'
d = pq(url=url)
text = d('.story-content').text()
Community
  • 1
  • 1
Roman Susi
  • 4,135
  • 2
  • 32
  • 47
  • What about packages like the one @oxymor0n? What do you tjink about their accuracy – Abhishek Bhatia May 20 '15 at 23:07
  • @AbhishekBhatia the answer and goose-extractor was new to me, so it may work and seems to be nearest to your specs, but hard to say without testing. I guess, his answer is the best one here. Please, accept it if nothing better comes. – Roman Susi May 21 '15 at 03:41
1

I can highly recommend using Trafilatura. Super easy to implement and it's fast!

import trafilatura
url = 'www.example.com'
downloaded = trafilatura.fetch_url(url)
article_content = trafilatura.extract(downloaded)

Which gives:

'This domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\nMore information...'

You can also give it the HTML directly, like this:

trafilatura_text = trafilatura.extract(html, include_comments=False)

If you're interested in more fields, like authors / publication date, you can use bare_extraction:

import trafilatura
url = 'www.example.com'
downloaded = trafilatura.fetch_url(url)
trafilatura.bare_extraction(downloaded, include_links=True)

Which will give you:

{'title': 'Example Domain',
 'author': None,
 'url': None,
 'hostname': None,
 'description': None,
 'sitename': None,
 'date': None,
 'categories': [],
 'tags': [],
 'fingerprint': None,
 'id': None,
 'license': None,
 'body': None,
 'comments': '',
 'commentsbody': None,
 'raw_text': None,
 'text': 'This domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\nMore information...'}
Muriel
  • 449
  • 4
  • 20
  • your answer is the only one that works, all the other fancy tools like scrapy or whatever just don't do the job. – Gary Allen Apr 15 '23 at 13:19
0

You can use htmllib or HTMLParser you can use these to parse your html file

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

A sample of code tooken from HTMLParser page

YoungerDryas
  • 114
  • 2
  • 10