3

I'm building an application using python which involves getting news articles from RSS feeds. As part of my project, I have decided to use boilerpipe in order to extract just the article content from the html page on which the article appears.

Although boilerpipe was originally written for java, it has been ported to python too. You can see its page on github here: https://github.com/misja/python-boilerpipe

The problem is that I get an exception when trying to import it using:

from boilerpipe.extract import Extractor

The error I get is:

Traceback (most recent call last):
File "", line 1, in
File "build\bdist.win32\egg\boilerpipe\extract__init__.py", line 12, in
File "C:\Python26\lib\site-packages\jpype_jclass.py", line 54, in JClass
raise _RUNTIMEEXCEPTION.PYEXC("Class %s not found" % name)
jpype._jexception.ExceptionPyRaisable: java.lang.Exception: Class 
de.l3s.boilerpipe.sax.HTMLHighlighter not found

What might be causing this problem and how can I fix it?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
user1106610
  • 51
  • 2
  • 3
  • you could parse the feed using pure Python [feedparser module](http://packages.python.org/feedparser/introduction.html) – jfs Feb 19 '12 at 19:48
  • @J.F.Sebastian Thanks. I'm actually using feedparser already which I use to actually get the articles (well, the urls to them). Once I get an article, I then want to extract just the article content from its page (excluding sidebars, menus and other random text). Based on my research, boilerpipe seems to be the best way forward for this. Unfortunately, I'm having the problem which I mentioned above with importing it into python. – user1106610 Feb 19 '12 at 19:58

5 Answers5

4

This worked for me on Mac OS X 10.8.5 with Python 2.7.9.:

pip install JPype1    # to install https://pypi.python.org/pypi/JPype1
pip install charade
git clone https://github.com/misja/python-boilerpipe.git
cd python-boilerpipe
sudo python setup.py install

Then you should be able to do in the python console

>>> from boilerpipe.extract import Extractor
>>> extractor = Extractor(extractor='ArticleExtractor', url="http://en.wikipedia.org/wiki/Main_Page")
>>> print extractor.getText()
asmaier
  • 11,132
  • 11
  • 76
  • 103
1

You are missing boiler pipe java packages install, you can find it here - http://code.google.com/p/boilerpipe/downloads/list

you have only install python boilerpipe wrapper.

Mutant
  • 3,663
  • 4
  • 33
  • 53
1

The following worked best for me:

git clone https://github.com/misja/python-boilerpipe.git
cd python-boilerpipe
sudo python setup.py install

You may have to:

  • install JPype (sudo apt-get install python-jpype on Ubuntu)
  • install charade (sudo pip install charade)

But you won't have to install the boilerpipe JAVA jar's since setup loads this for you.

I tried installing the python boilerpipe from pip, but had no luck. I was successfully running boilerplate java code, but kept getting this same error.

phillipwei
  • 1,243
  • 2
  • 12
  • 25
  • Note that Ubuntu's jpype apt package only installs it for Python 2. To install for Python3, I had to use this (go to source at bottom): https://pypi.python.org/pypi/JPype1-py3 – sudo Jan 27 '17 at 22:22
0

I had the same issue. I saw the set-up details provided by the author of Mining the web. Here is the link to his Github page for boilerpipe

https://github.com/misja/python-boilerpipe/blob/master/setup.py

Taposh DuttaRoy
  • 411
  • 2
  • 5
  • 13
0

The class HTMLHighlighter wasn't found. Did you set your JAVA_HOME? The documentation states:

Be sure to have set JAVA_HOME properly since jpype depends on this setting.

beerbajay
  • 19,652
  • 6
  • 58
  • 75
  • Thanks for the answer. I'm sure I have done that correctly. I created a JAVA_HOME system variable and set it's value to `'C:\Program Files\Java\jdk1.7.0` which is what I'm pretty sure it should be. – user1106610 Feb 19 '12 at 19:19
  • Just to clarify; I mean that I had already done the above but it still doesn't work despite that. – user1106610 Feb 19 '12 at 19:29
  • I've spent hours trying to get this to work now. Still no luck though. – user1106610 Feb 19 '12 at 21:10
  • Hmm... I don't know much about windows environment variables, but are you sure your python context can see the `JAVA_HOME`? Try: `print os.environ['JAVA_HOME']` – beerbajay Feb 20 '12 at 09:15
  • Thanks, tried that. It shows me `C:\Program Files\Java\jdk1.7.0` so I guess it can see it. I'm not really sure what the problem is. If you wished to use this package, what steps would you go through after downloading it? Maybe there's something I haven't done... – user1106610 Feb 20 '12 at 09:24
  • I would download [boilerpipe](https://code.google.com/p/boilerpipe/downloads/detail?name=boilerpipe-1.2.0-bin.tar.gz), extract it to a temporary directory, copy the `boilerpipe-1.2.0.jar` file to somewhere on my classpath. If I wasn't using java a lot, this might be `$JAVA_HOME/lib/` – beerbajay Feb 20 '12 at 15:53