Convert Rendered HTML to text

Question

Is it possible to convert a rendered HTML page, into plain text, or even formatted text?

For example, the following HTML page/code:

<html>
<head></head>
<body>
<p>This is the first paragraph</p>
<ol>
<li>This is a list item</li>
<li>And another</li>
</ol>
<p>This is the second paragraph</p>
</body>
</html>

Would be converted into the following string value:

"

This is the first paragraph

This is a list item

And another

This is the second parapgrah

"

If so, how could i do that? Can i use a built-in object like the webbrowser to access the rendered content?

Edit:

Solution: There does not seem to be any built in way of getting rendered HTML code, into plain text. You have to get some third party tool to do it for you, or build your own. For the third part tool solution, look at the first link in the comments below.

Extra Information:

For my problem, I am basically converting an RTF document into HTML. I am using a library to do so, which can be found here: Writing your own RTF Converter

However, this library does not take into account indented lists... for example, using this converter, this RTF content:

Some Text

More Text

a. Sub Text

Becomes, in the HTML converted version, this:

Some Text

More Text:

Sub Text

In an effort to fix this problem (since the author of the library doesn't seem interested in fixing this), i decided to perform my own replacements after the contents has been converted. In order to do this, i need to compare the original RTF text, with the HTML RENDERED text, in order to see if the bullets numbering match or not. That is why i wanted an easy way of getting rendered HTML contents into a string... i could then parse out the list items as needed, and compare their headers to the RTF headers.

It seems i will have to manually parse out any OL and UL tags from the converted HTML, and assign a value myself to each LI entry within, in order to check that result against the RTF version.

Thanks to all who contributed to this answer.

Yes i'm definitely not looking for a tag stripper... but the rendered output instead. -- I've looked at the linked solution, but it involved getting some 3rd party, unsafe tool, to run on my internal network, which is not allowed here. Are there any built-in alternatives? — MaxOvrdrv, Oct 28 '14 at 13:01
As far as i know, there are no built-in alternatives, write your own or take one from anyone else and check their code — ReeCube, Oct 28 '14 at 13:04
Ah ok... that is sad. I will edit my question to give more details as to my problem, and probably end up making my own specific parser and renderer-to-text method... — MaxOvrdrv, Oct 28 '14 at 13:06

score 0 · Answer 1 · answered Oct 28 '14 at 13:24

0

Using jQuery,

function htmlStripTags(value) {
    return $("<div/>").html(value).text();
}

function htmlDecode(value) {
    return $("<textarea/>").html(value).text();
}

function htmlEncode(value) {
    return $('<textarea/>').text(value).html();
}

jQuery will create an in memory "<div/>" tag. It will strip out the html tags leaving only the text. NOTE: Using a "<textarea/>" will preserve the html tags.

answered Oct 28 '14 at 13:24

Jason Williams

2,740
28
36

Look at the tags. This is a C# solution. Not a JQuery one. And, i'm looking for a rendered version, not a stripped tag of the source. – MaxOvrdrv Oct 28 '14 at 13:34
Sorry. Your question said you wanted to "convert a rendered HTML page". It sounds like you want a solution to manipulate pre-rendered html. The most important question is: Can you count on the html structure (Is the structure something your server generated?)? Or, might it be mucked up by all sorts of Html garbage... as if you were scraping website data from external sources? – Jason Williams Oct 28 '14 at 14:51
Hmmm... that's a bit of a loaded question for me. It is possible to have malformed HTML in there since i didn't build the library that converts the RTF to HTML... to get a structural lowdow: user goes into Word, creates document, imports document into other app, that app converts it to RTF for itself with its own codes, then, through that app's API, i get the RawRTF back out, that i then convert to HTML, to display within yet another app we have. Spider Web of apps and conversions really, but i have no say in the matter so... it's possible yes, so i can't use RegExp. – MaxOvrdrv Oct 28 '14 at 17:16
I'd bet you could write a C# WebDriver Application using Internet Explorer Developer Channel --> Render the HTML in the browser --> Use WebDriver to copy the contents of the page to the clipboard --> and dump the clipboard as text back to your Application. This solution would require the server had IE Developer Channel and WebDriver dll installed. – Jason Williams Oct 28 '14 at 20:39

Convert Rendered HTML to text

1 Answers1