"Render" simple html into String

Question

Is there a library to parse HTML into a String object, either in Java, C# or any other programming language out there.

This is my situation:

I have some documentation which came from an wysiwyg editor and contain some basic html tags such as <p><br> and others like this:

<p>This &nbsp;</p><font>etc</font><br>
<span> and this, etc.

When exported to some other tool it gets converted to plain text making it unreadable. What I'm doing right now is to: "Select all" > "Save as x.html" > "Open in browser" > "Select all" > "Paste"

Which I know could be automatized with a program.

Is there an library to do this? That is, to "render" simple HTML? Preferably to a string to I can put it into my clipboard. Removing the html tags is not enough for I would get a very long line without carriage returns.

Sounds like your not rendering so much as wanting to strip the html is that right? — NotMe, Oct 23 '12 at 15:33
@ChrisLively That's correct. I didn't knew what work to pick. "render" was the closest I can think of because I want to keep \n from
etc. I updated the question to clarify that. — OscarRyz, Oct 23 '12 at 15:35
`c#` + `java` + `html` = `language-agnostic` in the sense that they cover most of the programmers on the planet? — kizzx2, Oct 23 '12 at 16:04

score 1 · Answer 1 · answered Oct 23 '12 at 15:51

1

In java you can use http://docs.oracle.com/javase/1.5.0/docs/api/javax/swing/text/html/parser/DocumentParser.html

You can provide ParserCallback to handle the text and ignore the tags.

answered Oct 23 '12 at 15:51

Ashwinee K Jha

9,187
2
25
19

score 1 · Answer 2 · edited May 23 '17 at 12:03

For Python, you can extend this excellent function with entity refs to do what you seem to need;

from HTMLParser import HTMLParser
from htmlentitydefs import name2codepoint

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append(unichr(name2codepoint[name]))
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

print strip_tags('<html>olle&lt;</br>')
 olle<

It has some issues with non-ascii strings... reading the docs... — OscarRyz, Oct 23 '12 at 16:28

Stefan P. · Answer 3 · 2012-10-23T15:52:25.243

0

If I get it right you want remove all html tags, thats easy with C#:

var plainText = Regex.Replace(htmlString.Replace("<br>", Environment.NewLine).Replace("&nbsp;", " "), @"<[^>]*>", String.Empty);

edited Oct 23 '12 at 15:52

answered Oct 23 '12 at 15:36

Stefan P.

9,489
6
29
43

Well I want to keep the carriage returns so I can read it. Otherwise what I have is a huge continuous string – OscarRyz Oct 23 '12 at 15:39
"That's easy", he says. Right now it might _seem_ like it. (; – Grant Thomas Oct 23 '12 at 15:45
1

@StefanP. What about ` ` ? Easy? – L.B Oct 23 '12 at 15:47
2

@StefanP. and other codes such as `&` `<` `>` etc. – L.B Oct 23 '12 at 15:55

score 0 · Answer 4 · answered Oct 23 '12 at 16:15

There are 2 ways to go at it:

Write arbitrarily sophisticated parsers to clean the data. This is what have been suggested by other answers. If your inputs are not very hardcore this is often the quick win.
However if you have very complicated inputs and want "high fidelity", you can use a "real" browser.

A very easy option is to use PhantomJS. Here is an example using innerText to extract texts from Web pages:

var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            return document.getElementById('myagent').innerText;
        });
        console.log(ua);
    }
    phantom.exit();
});

There are also options like the WebBrowser class (MSIE) or GeckoFX. I suspect the learning curves would be very steep to go down those paths, though.

"Render" simple html into String

4 Answers4