I am trying to use MSHTML to get plaintext from the HTML of a website. It appears to be working and providing fairly clean plain-text (any suggestions for a better HTML to plaintext solution are welcome too)
Everything is working fine except that frequently I will get a popup "Windows Security Warning" asking if I want to allow the website to put cookies on my computer (I have seen this warning before when using IE). Also, it periodically opens Google Chrome to the google sign-in page which is very odd. Is there some way to disable all script and external resource loading? I only want to get the plain text and don't need it to actually execute the page.
Here's my code:
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { currentCode });
currentCode = sanitizeText(htmldoc2.body.outerText.Replace('\n', ' ').Replace('\r', ' ').Replace('\"', '"'), false, false);