-1

Is there a library to parse HTML into a String object, either in Java, C# or any other programming language out there.

This is my situation:

I have some documentation which came from an wysiwyg editor and contain some basic html tags such as <p><br> and others like this:

<p>This &nbsp;</p><font>etc</font><br>
<span> and this, etc.

When exported to some other tool it gets converted to plain text making it unreadable. What I'm doing right now is to: "Select all" > "Save as x.html" > "Open in browser" > "Select all" > "Paste"

Which I know could be automatized with a program.

Is there an library to do this? That is, to "render" simple HTML? Preferably to a string to I can put it into my clipboard. Removing the html tags is not enough for I would get a very long line without carriage returns.

OscarRyz
  • 196,001
  • 113
  • 385
  • 569
  • Sounds like your not rendering so much as wanting to strip the html is that right? – NotMe Oct 23 '12 at 15:33
  • @ChrisLively That's correct. I didn't knew what work to pick. "render" was the closest I can think of because I want to keep \n from
    etc. I updated the question to clarify that.
    – OscarRyz Oct 23 '12 at 15:35
  • `c#` + `java` + `html` = `language-agnostic` in the sense that they cover most of the programmers on the planet? – kizzx2 Oct 23 '12 at 16:04

4 Answers4

1

In java you can use http://docs.oracle.com/javase/1.5.0/docs/api/javax/swing/text/html/parser/DocumentParser.html

You can provide ParserCallback to handle the text and ignore the tags.

Ashwinee K Jha
  • 9,187
  • 2
  • 25
  • 19
1

For Python, you can extend this excellent function with entity refs to do what you seem to need;

from HTMLParser import HTMLParser
from htmlentitydefs import name2codepoint

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append(unichr(name2codepoint[name]))
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

print strip_tags('<html>olle&lt;</br>')
 olle<
Community
  • 1
  • 1
Joachim Isaksson
  • 176,943
  • 25
  • 281
  • 294
0

If I get it right you want remove all html tags, thats easy with C#:

var plainText = Regex.Replace(htmlString.Replace("<br>", Environment.NewLine).Replace("&nbsp;", " "), @"<[^>]*>", String.Empty);
Stefan P.
  • 9,489
  • 6
  • 29
  • 43
0

There are 2 ways to go at it:

  1. Write arbitrarily sophisticated parsers to clean the data. This is what have been suggested by other answers. If your inputs are not very hardcore this is often the quick win.

  2. However if you have very complicated inputs and want "high fidelity", you can use a "real" browser.

A very easy option is to use PhantomJS. Here is an example using innerText to extract texts from Web pages:

var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            return document.getElementById('myagent').innerText;
        });
        console.log(ua);
    }
    phantom.exit();
});

There are also options like the WebBrowser class (MSIE) or GeckoFX. I suspect the learning curves would be very steep to go down those paths, though.

kizzx2
  • 18,775
  • 14
  • 76
  • 83