TagSoup is a SAX-compliant parser written in Java that parses HTML as it is found in the wild.
Questions tagged [tag-soup]
38 questions
13
votes
5 answers
How to get an attribute from an XMLReader
I have some HTML that I'm converting to a Spanned using Html.fromHtml(...), and I have a custom tag that I'm using in it:
So I've implemented a TagHandler to handle this custom tag, like so:
public void handleTag( boolean…

Jason Robinson
- 31,005
- 19
- 77
- 131
6
votes
1 answer
jTidy and TagSoup documentation
I'm looking for documentation (officially documentation if it is possible) for TagSoup and jTidy libraries.
I want use this libraries to manipulate html "tagsoup" files that include xml tags with different namespaces mixed between html (html, xhtml…

angelcervera
- 3,699
- 1
- 40
- 68
5
votes
3 answers
Tagsoup fails to parse html document from a StringReader ( java )
I have this function:
private Node getDOM(String str) throws SearchEngineException {
DOMResult result = new DOMResult();
try {
XMLReader reader = new Parser();
…

zajcev
- 303
- 2
- 7
4
votes
1 answer
XPath Expression returns nothing for //element, but //* returns a count
I'm using XOM with the following sample data:
Element root = cleanDoc.getRootElement();
//find all the bold elements, as those mark institution and clinic.
Nodes nodes = root.query("//*");

Stefan Kendall
- 66,414
- 68
- 253
- 406
4
votes
1 answer
Using a SAX parser when I need a DocumentBuilder
XMLBeam is a nice XML to POJO unmarshaler (via XPath), but it only allows you to configure a DocumentBuilder or DocumentBuilderFactory.
TagSoup is a nice SAX parser that lets you parse nasty HTML documents as though they were XML.
I would like to…

Neil McGuigan
- 46,580
- 12
- 123
- 152
4
votes
2 answers
Extract URL from href-tag in groovy
I need to parse a malformed HTML-page and extract certain URLs from it as any kind of Collection.
I don't really care what kind of Collection, I just need to be able to iterate over it.
Let's say we have a structure like this:
…

Jakunar
- 156
- 1
- 2
- 7
3
votes
1 answer
Wrap a tag around plain html text
I have this structure in my html document:
"You began the evening well, Charlotte," said Mrs. Bennet with civil self–command to Miss Lucas. "You were Mr. Bingley's first choice."
But i need my "plain…
Richard
- 14,427
- 9
- 57
- 85
3
votes
1 answer
TagSoup and XPath
I'm trying to use TagSoup with XPath (JAXP). I know how to obtain SAX parser from TagSoup (or XMLReader). But I failed to find how to create DocumentBuilder that will use that SAX parser. How do I do that?
Thank you.
EDIT: Sorry for being so general…

IgorY
- 535
- 4
- 7
3
votes
3 answers
Strange behavior with tagsoup and Groovy's XmlSlurper
Let's say I want to parse the phone number from an an xml string like this:
str = """
123 New York, NY 10019
…
(212) 212-0001
user308808
3
votes
1 answer
Point TagSoup Parser to use HTML5 version
I want TagSoup settings to use HTML5 standars.
I am using tagsoup Parser which is adhearing to HTML4 which doesn't allow a

Anish Somani
- 43
- 7
3
votes
0 answers
parsing HTML5 with Enlive/Tagsoup/JSoup
HTML5 allows tags to appear in the body, but Enlive does not seem to support this:
(deftest test-enlive
(testing "enlive"
(let [html-as-string "
the…

George Armhold
- 30,824
- 50
- 153
- 232
3
votes
1 answer
How to use JAXB with HTML?
I would like to unmarshall some nasty HTML to a Java object using JAXB. (I'm on Java 7).
Tagsoup is a SAX-compliant XML parser that can handle nasty HTML.
How can I setup JAXB to use Tagsoup for unmarshalling HTML?
I tried setting…

Neil McGuigan
- 46,580
- 12
- 123
- 152
3
votes
1 answer
Parsing XML in Groovy with namespace and entities
Parsing XML in Groovy should be a piece of cake, but I always run into problems.
I would like to parse a string like this:
This is a test with some formattings.
And this has a…

rdmueller
- 10,742
- 10
- 69
- 126
2
votes
1 answer
Jtidy StringIndexOutOfBoundsException in Jmeter
I want to retrieve content from a webpage using JMeter.
The data I'm looking for is inside a javascript block :
(...)