2

Trying to figure out a way to strip out specific information(name,description,id,etc) from an html file leaving behind the un-wanted information and storing it in an xml file.

I thought of trying using xslt since it can do xml to html... but it doesn't seem to work the other way around.

I honestly don't know what other language i should try to accomplish this. i know basic java and javascript but not to sure if it can do it.. im kind of lost on getting this started.

i'm open to any advice/help. willing to learn a new language too as i'm just doing this for fun.

Tom
  • 149
  • 2
  • 4
  • 11

4 Answers4

3

There are a number of Java libraries for handling HTML input that isn't well-formed (according to XML). These libraries also have built-in methods for querying or manipulating the document, but it's important to realize that once you've parsed the document it's usually pretty easy to treat it as though it were XML in the first place (using the standard Java XML interfaces). In other words, you only need these libraries to parse the malformed input; the other utilities they provide are mostly superfluous.

Here's an example that shows parsing HTML using HTMLCleaner and then converting that object into a standard org.w3c.dom.Document:

TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
DomSerializer ser = new DomSerializer(new CleanerProperties());
org.w3c.dom.Document doc = ser.createDOM(tagNode);

In Jsoup, simply parse the input and serialize it into a string:

String text = Jsoup.parse("<html><div><p>test").outerHtml();

And convert that string into a W3C Document using one of the methods described here:

You can now use the standard JAXP interfaces to transform this document:

TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);

Note: Provide some XSLT source to tFact.newTransformer() to do something more useful than the identity transform.

Community
  • 1
  • 1
Wayne
  • 59,728
  • 15
  • 131
  • 126
2

I would use HTMLAgilityPack or Chris Lovett's SGMLReader.

Or, simply HTML Tidy.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
0

Ideally, you can treat your HTML as XML. If you're lucky, it will already be XHTML, and you can process it as HTML. If not, use something like http://nekohtml.sourceforge.net/ (a HTML tag balancer, etc.) to process the HTML into something that is XML compliant so that you can use XSLT.

I have a specific example and some notes around doing this on my personal blog at http://blogger.ziesemer.com/2008/03/scraping-suns-bug-database.html.

ziesemer
  • 27,712
  • 8
  • 86
  • 94
0
  • TagSoup
  • JSoup
  • Beautiful Soup
bmargulies
  • 97,814
  • 39
  • 186
  • 310