Questions tagged [tag-soup]

TagSoup is a SAX-compliant parser written in Java that parses HTML as it is found in the wild.

38 questions
13
votes
5 answers

How to get an attribute from an XMLReader

I have some HTML that I'm converting to a Spanned using Html.fromHtml(...), and I have a custom tag that I'm using in it: So I've implemented a TagHandler to handle this custom tag, like so: public void handleTag( boolean…
Jason Robinson
  • 31,005
  • 19
  • 77
  • 131
6
votes
1 answer

jTidy and TagSoup documentation

I'm looking for documentation (officially documentation if it is possible) for TagSoup and jTidy libraries. I want use this libraries to manipulate html "tagsoup" files that include xml tags with different namespaces mixed between html (html, xhtml…
angelcervera
  • 3,699
  • 1
  • 40
  • 68
5
votes
3 answers

Tagsoup fails to parse html document from a StringReader ( java )

I have this function: private Node getDOM(String str) throws SearchEngineException { DOMResult result = new DOMResult(); try { XMLReader reader = new Parser(); …
zajcev
  • 303
  • 2
  • 7
4
votes
1 answer

XPath Expression returns nothing for //element, but //* returns a count

I'm using XOM with the following sample data: Element root = cleanDoc.getRootElement(); //find all the bold elements, as those mark institution and clinic. Nodes nodes = root.query("//*");
Stefan Kendall
  • 66,414
  • 68
  • 253
  • 406
4
votes
1 answer

Using a SAX parser when I need a DocumentBuilder

XMLBeam is a nice XML to POJO unmarshaler (via XPath), but it only allows you to configure a DocumentBuilder or DocumentBuilderFactory. TagSoup is a nice SAX parser that lets you parse nasty HTML documents as though they were XML. I would like to…
Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
4
votes
2 answers

Extract URL from href-tag in groovy

I need to parse a malformed HTML-page and extract certain URLs from it as any kind of Collection. I don't really care what kind of Collection, I just need to be able to iterate over it. Let's say we have a structure like this: …
Jakunar
  • 156
  • 1
  • 2
  • 7
3
votes
1 answer

Wrap a tag around plain html text

I have this structure in my html document:

"You began the evening well, Charlotte," said Mrs. Bennet with civil self–command to Miss Lucas. "You were Mr. Bingley's first choice."

But i need my "plain…
Richard
  • 14,427
  • 9
  • 57
  • 85
3
votes
1 answer

TagSoup and XPath

I'm trying to use TagSoup with XPath (JAXP). I know how to obtain SAX parser from TagSoup (or XMLReader). But I failed to find how to create DocumentBuilder that will use that SAX parser. How do I do that? Thank you. EDIT: Sorry for being so general…
IgorY
  • 535
  • 4
  • 7
3
votes
3 answers

Strange behavior with tagsoup and Groovy's XmlSlurper

Let's say I want to parse the phone number from an an xml string like this: str = """
123 New York, NY 10019
(212) 212-0001
user308808
3
votes
1 answer

Point TagSoup Parser to use HTML5 version

I want TagSoup settings to use HTML5 standars. I am using tagsoup Parser which is adhearing to HTML4 which doesn't allow a
3
votes
0 answers

parsing HTML5 with Enlive/Tagsoup/JSoup

HTML5 allows tags to appear in the body, but Enlive does not seem to support this: (deftest test-enlive (testing "enlive" (let [html-as-string "
the…
George Armhold
  • 30,824
  • 50
  • 153
  • 232
3
votes
1 answer

How to use JAXB with HTML?

I would like to unmarshall some nasty HTML to a Java object using JAXB. (I'm on Java 7). Tagsoup is a SAX-compliant XML parser that can handle nasty HTML. How can I setup JAXB to use Tagsoup for unmarshalling HTML? I tried setting…
Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
3
votes
1 answer

Parsing XML in Groovy with namespace and entities

Parsing XML in Groovy should be a piece of cake, but I always run into problems. I would like to parse a string like this:

This is a test with some formattings.
And this has a…

rdmueller
  • 10,742
  • 10
  • 69
  • 126
2
votes
1 answer

Jtidy StringIndexOutOfBoundsException in Jmeter

I want to retrieve content from a webpage using JMeter. The data I'm looking for is inside a javascript block : (...)