Is it possible to convert a rendered HTML page, into plain text, or even formatted text?
For example, the following HTML page/code:
<html>
<head></head>
<body>
<p>This is the first paragraph</p>
<ol>
<li>This is a list item</li>
<li>And another</li>
</ol>
<p>This is the second paragraph</p>
</body>
</html>
Would be converted into the following string value:
"
This is the first paragraph
- This is a list item
- And another
This is the second parapgrah
"
If so, how could i do that? Can i use a built-in object like the webbrowser to access the rendered content?
Edit:
Solution: There does not seem to be any built in way of getting rendered HTML code, into plain text. You have to get some third party tool to do it for you, or build your own. For the third part tool solution, look at the first link in the comments below.
Extra Information:
For my problem, I am basically converting an RTF document into HTML. I am using a library to do so, which can be found here: Writing your own RTF Converter
However, this library does not take into account indented lists... for example, using this converter, this RTF content:
- Some Text
More Text
a. Sub Text
Becomes, in the HTML converted version, this:
- Some Text
- More Text:
- Sub Text
In an effort to fix this problem (since the author of the library doesn't seem interested in fixing this), i decided to perform my own replacements after the contents has been converted. In order to do this, i need to compare the original RTF text, with the HTML RENDERED text, in order to see if the bullets numbering match or not. That is why i wanted an easy way of getting rendered HTML contents into a string... i could then parse out the list items as needed, and compare their headers to the RTF headers.
It seems i will have to manually parse out any OL and UL tags from the converted HTML, and assign a value myself to each LI entry within, in order to check that result against the RTF version.
Thanks to all who contributed to this answer.