Questions tagged [htmlcleaner]

HtmlCleaner is open-source HTML parser written in Java.

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

HtmlCleaner can be used in java code, as command line tool or as Ant task. It is designed to be small, independent (no runtime dependencies except JRE 1.5+), fast and flexible (its behavior is configurable through number of parameters). Although the main motive was to prepare ordinary HTML for XML processing with XPath, XQuery and XSLT, structured data produced by HtmlCleaner may be consumed and handled in menu other ways.

Features:

  • HtmlCleaner parses input HTML and generates tree-structure suitable for programmatic manipulation.
  • Serializers are responsible for outputting the DOM structure to XML, HTML, DOM or JDom.
  • Parsing phase relies on tag descriptions which can be customized by the user.
  • HtmlCleaner's behaviour can be configured through number of parameters.
  • HtmlCleaner is thread safe, meaning that single instance can clean multiple html sources at the same time.
  • HtmlCleaner can be used from Java code, from command line or as Ant task.
  • HtmlCleaner requires JRE 1.5+.

Official Website: http://htmlcleaner.sourceforge.net/

Useful Links:

96 questions
7
votes
3 answers

Getting cleaned HTML in text from HtmlCleaner

I want to see the cleaned HTML that we get from HTMLCleaner. I see there is a method called serialize on TagNode, however don't know how to use it. Does anybody have any sample code for it? Thanks Nayn
Nayn
  • 3,594
  • 8
  • 38
  • 48
7
votes
2 answers

web scraping java beginner

I am new to Java, I would like to become really good in web scraping and parsing data Are there any sites related to web scraping that would help me understand the how the APIs like htmcleaner, web-harvest, htmlparser work?? I'm still not proficient…
scorpy
  • 81
  • 1
  • 1
  • 4
6
votes
2 answers

Tidy HTML output with JavaScript

I have a large chunk of HTML. In order for it to fit a certain container, I crop the HTML (not just the text) at, let’s say, 200 characters. Obviously, some of the tags will remain unclosed in this case. Is there a way, except writing the cleaner…
spliter
  • 12,321
  • 4
  • 33
  • 36
5
votes
7 answers

How to add matching start tag in HTML

I have html content which looks like Hello world
New day
I would like to parse this html snippet and add a starting div tag before Hello. What is the approach I could follow? I tried to use HTMLCLeaner but it didnt…
Thunderhashy
  • 5,291
  • 13
  • 43
  • 47
4
votes
3 answers

xPath expression: Getting elements even if they don't exist

I have this xPath expression that I'm putting into htmlCleaner: //table[@class='StandardTable']/tbody/tr[position()>1]/td[2]/a/img Now, my issue is that it changes, and some times the /a/img element is not present. So I would like an expression…
Nacht
  • 10,488
  • 8
  • 31
  • 39
4
votes
1 answer

Remove MS Word "HTML" using PHP

Possible Duplicate: What is the best free way to clean up Word HTML? PHP to clean-up pasted Microsoft input I allow clients to enter notes in a rich text editor, and have only recently upgraded to ckEditor 3x, which strips MS word classes,…
a coder
  • 7,530
  • 20
  • 84
  • 131
4
votes
2 answers

HTMLCLEANER handle Spanish characters

I am using HtmlCleaner library in order to parse/convert HTML files in java. It seems that is not able to handle Spanish characters like 'ÁáÉéÍíÑñÓóÚúÜü' Is there any property which I can set in HtmlCleaner for handling this or any other solution?…
choop
  • 921
  • 2
  • 9
  • 28
3
votes
1 answer

How to get a value of element with HTMLcleaner

trying to get the value of the element "a "and "span ". Using HTMLCleaner.

Tron 2001

here is the…
TT_from_KZ
  • 55
  • 1
  • 5
3
votes
4 answers

What library to use for building HTML documents?

Could please anybody recommend libraries that are able to do the opposite thing than these libraries ? HtmlCleaner, TagSoup, HtmlParser, HtmlUnit, jSoup, jTidy, nekoHtml, WebHarvest or Jericho. I need to build html pages, build the DOM model from…
lisak
  • 21,611
  • 40
  • 152
  • 243
3
votes
1 answer

How can I remove content from