Questions tagged [boilerpipe]

The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The boilerpipe library for Java provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

77 questions
16
votes
4 answers

Accessing JVM from python

>>> import boilerpipe Traceback (most recent call last): File "", line 1, in File "C:\Anaconda\lib\site-packages\boilerpipe\__init__.py", line 10, in jpype.startJVM(jpype.getDefaultJVMPath(), "-Djava.class.path=%s" %…
Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142
12
votes
1 answer

python-boilerpipe hangs with multiprocessing

I am trying to run boilerpipe with Python multiprocessing. Doing this to parse RSS feeds from multiple sources. The problem is it hangs in one of the threads after processing some links. The whole flow works if I remove the pool and run it in a…
dpatro
  • 195
  • 2
  • 12
6
votes
4 answers

Is there a boilerpipe port for .net?

Does anybody know a .net port for the boilerpipe library?
aogan
  • 2,241
  • 1
  • 15
  • 24
5
votes
2 answers

how to extract main text from html using Tika

I just want to know that how i can extract main text and plain text from html using Tika? maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to show it? thanks very much in advance
user2651995
  • 63
  • 1
  • 5
5
votes
1 answer

Boilerpipe - How do I output JSON?

I am using boilerpipe and it seems great, but I want to output JSON. I am using the Java version and testing in NetBeans as follows: final URL url = new…
Wadester
  • 453
  • 4
  • 12
4
votes
1 answer

Apache Tika how to extract html body with out header and footer content

I am looking to extract entire body content of html except header and footer, however I am getting exception org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared Below is my code that i have created as mentioned at…
Trinadh Gupta
  • 306
  • 5
  • 18
4
votes
0 answers

How to summarize the main content of an article in a webpage?

I am trying to write an article summarizer for HTML pages. So far I have used boilerpipe and classifier4J. //url can be any url in String public String getArticleSummaryFromUrl() { private Document doc = Jsoup.connect(url).get();; String…
Pritam Banerjee
  • 17,953
  • 10
  • 93
  • 108
3
votes
5 answers

Trouble importing boilerpipe in python

I'm building an application using python which involves getting news articles from RSS feeds. As part of my project, I have decided to use boilerpipe in order to extract just the article content from the html page on which the article…
user1106610
  • 51
  • 2
  • 3
3
votes
1 answer

boilerpipe web API

I would like to host my own version of the boilerpipe web API (http://code.google.com/p/boilerpipe/). The appspot site is http://boilerpipe-web.appspot.com/ I would like to self host it. Can someone give me directions on how to use the Boilerpipe…
Kiran
  • 31
  • 2
3
votes
1 answer

How to extract the main content from a webpage?

I am trying to write a summary of the content of a web page. For that I need to extract all the irrelevant text and data from a webpage. I have used boilerpipe, but the text extraction is not good.The results are here, where you can see lot of…
Pritam Banerjee
  • 17,953
  • 10
  • 93
  • 108
3
votes
1 answer

ClassNotFoundException: org.apache.xerces.parsers.AbstractSAXParser when using boilerpipe

I am very new to boilerpipe and I am trying out the following basic code: package contentExtraction; import java.net.URL; import de.l3s.boilerpipe.extractors.ArticleExtractor; public class ContentExtractor { public static void main(String[]…
psr
  • 2,619
  • 4
  • 32
  • 57
3
votes
0 answers

How to avoid server error 401 (and 403) while using boilerpipe?

I use BoilerPipe for Java to extract some articles from the internet. It works in a lot of sites, but in several sites I get a Http 401 server error, when I don't need any authentication in my web browser... Here's an example of site which returns…
Malik
  • 207
  • 1
  • 2
  • 14
3
votes
2 answers

html text extraction for php

There are a bunch of HTML text extraction tools out there. Mostly for Java or Python. The one I come across most often is boilerpipe. There are a few APIs here and there, and some seem to work pretty well. Does anyone know of anything in PHP that…
Bill
  • 5,478
  • 17
  • 62
  • 95
2
votes
1 answer

how to use boilerpipe with a local html file?

I have an html file on my local disk and would like to extract text from it using BoilerPipe. The "getText" method from the class ExtractorBase accepts a reader, so I wrote: FileReader fr = new…
seinecle
  • 10,118
  • 14
  • 61
  • 120
2
votes
1 answer

has boilerpipe any restriction at all?

I want to use boilerpipe for scraping all acrticles (news) of a site for data mining purpose. In demo page of boilerpipe is noted: Due to heavy use of this free service in the past, the number of requests per user is limited has boilerpipe…
afruzan
  • 1,454
  • 19
  • 20
1
2 3 4 5 6