2

I am trying to use MSHTML to get plaintext from the HTML of a website. It appears to be working and providing fairly clean plain-text (any suggestions for a better HTML to plaintext solution are welcome too)

Everything is working fine except that frequently I will get a popup "Windows Security Warning" asking if I want to allow the website to put cookies on my computer (I have seen this warning before when using IE). Also, it periodically opens Google Chrome to the google sign-in page which is very odd. Is there some way to disable all script and external resource loading? I only want to get the plain text and don't need it to actually execute the page.

Here's my code:

HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { currentCode });
currentCode = sanitizeText(htmldoc2.body.outerText.Replace('\n', ' ').Replace('\r', ' ').Replace('\"', '"'), false, false);
Cœur
  • 37,241
  • 25
  • 195
  • 267
abagshaw
  • 6,162
  • 4
  • 38
  • 76
  • If you use such paranoid security settings and willing to disable scripts altogether why not use HtmlAgilityPack as many other web scraper solutions in C#? – Alexei Levenkov Mar 28 '14 at 20:43
  • "Paranoid security settings"? They are set to the default :) I haven't changed my security settings. And yes, I have tried to use HtmlAgilityPack but I found that it doesn't produce very clean plain-text (at least with the code I was using). I would be happy to give HtmlAgilityPack another try though. Could you point me in the right direction to find code that would convert HTML to plain-text using HtmlAgilityPack? – abagshaw Mar 28 '14 at 20:51
  • See http://stackoverflow.com/questions/731649/how-can-i-convert-html-to-text-in-c. The HTML Agility Pack has an HTML-to-Text sample (or it did). Also, the OP employed a different method that involved using the lynx.exe text-mode browser. – Jim Mischel Mar 28 '14 at 21:50

0 Answers0