Questions tagged [jericho-html-parser]

Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

It is an open source library released under both the Eclipse Public License (EPL) and GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in either one of these licence documents.

Features:

  • The presence of badly formatted HTML does not interfere with the parsing of the rest of the document, which makes the library ideal for use with "real-world" HTML that chokes other parsers.
  • ASP, JSP, PSP, PHP and Mason server tags are explicitly recognised by the parser. This means that normal HTML is still parsed properly even if there are server tags inside them, which is common for example when dynamically setting element attributes.
  • A stream based parsing option using the StreamedSource class, which allows memory efficient processing of large files using an event iterator. This is essentially a StAX alternative with the ability to process HTML and non-validating XML, as well as several other features not available in other streaming parsers.
  • In its standard form it is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
  • Compared to a tree based parser such as DOM, the memory and resource requirements can be far better if only small sections of the document need to be parsed or modified. Incorrect or badly formatted HTML can easily be ignored, unlike tree based parsers which must identify every node in the document from top to bottom.
  • Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.
  • The begin and end positions in the source document of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a tree.
  • The row and column number of each position in the source document are easily accessible.
  • Provides a simple but comprehensive interface for the analysis and manipulation of HTML form controls, including the extraction and population of initial values, and conversion to read-only or data display modes. Analysis of the form controls also allows data received from the form to be stored and presented in an appropriate manner.
  • Custom tag types can be easily defined and registered for recognition by the parser.
  • Built-in functionality to extract all text from HTML markup, suitable for feeding into a text search engine such as Apache Lucene.
  • Built-in functionality to render HTML markup with simple text formatting. (Click here for an online demonstration)
  • Built-in functionality to format HTML source code that indents elements according to their depth in the document element hierarchy. (Click here for an online demonstration)
  • Built-in functionality to compact HTML source code by removing all unnecessary white space.

Official Website: http://jericho.htmlparser.net/

Useful Links:

51 questions
6
votes
1 answer

jTidy and TagSoup documentation

I'm looking for documentation (officially documentation if it is possible) for TagSoup and jTidy libraries. I want use this libraries to manipulate html "tagsoup" files that include xml tags with different namespaces mixed between html (html, xhtml…
angelcervera
  • 3,699
  • 1
  • 40
  • 68
5
votes
1 answer

JSP and HTML parser for JAVA

I have been using Jsoup for parsing my HTML files and so far it does a great job. However, it's not able to parse any server tags ( <% ... %> ). I decided to extend it but I cannot find an easy way to extend its Parser and all those private/package…
Karl Cheng
  • 399
  • 5
  • 10
4
votes
3 answers

Pretty print ("indentation-only") HTML documents in Java (no JTidy)

We're generating HTML files out of apaches velocity generic template engine. The generated HTML is kind of ugly and not with correcht indentation. In my case I've got the HTML stored in a String which I want to manipulate in this way, that it looks…
Martin
  • 41
  • 1
  • 2
3
votes
1 answer

How do I look for a custom start tag using Jericho in Java?

As the title says, I'm trying to match a non-standard StartTagType in the form of How would I do this with Jericho? Edit: I have created the follow custom StartTagType: PrimoResultStartTagType primoSTT = new…
Karan
  • 1,636
  • 4
  • 19
  • 35
2
votes
1 answer

How to parse between two comments with Jericho?

I would like to be able to parse any and all text between two comment tags using Jericho. For example, abc 123 would return abc 123 is that at all possible?
atrox_
  • 23
  • 2
2
votes
2 answers

How to get text & Other tags between specific tags using Jericho HTML parser?

I have a HTML file which contains a specific tag, e.g. and the end tag is
. Now I want to get everything between those tags. I am using Jericho HTML parser in Java to parse the HTML. Is it possible to get the text &…
insomiac
  • 5,648
  • 8
  • 45
  • 73
2
votes
3 answers

How do I convert a Windows-1251 text to something readable?

I have a string, which is returned by the Jericho HTML parser and contains some Russian text. According to source.getEncoding() and the header of the respective HTML file, the encoding is Windows-1251. How can I convert this string to something…
Glory to Russia
  • 17,289
  • 56
  • 182
  • 325
2
votes
2 answers

Text Extraction from HTML using Java including source line number and code

The Question how to extract Text from HTML using Java has been viewed and duplicated a zillion times: Text Extraction from HTML Java Thanks to the answers found on Stackoverflow my current state of affairs is that I am using JSoup