0

I would like to download a .html page with scanned text images just as I can download it via:

browser -> right click -> Save Page As ... using C#.

I have tried 3 different methods:
1. and 2. from here: How can I download HTML source in C#
3. from here: Get HTML code from website in C#

I have tried saving the file as suggested here:
Creating a file (.htm) in C# or using
System.IO.File.WriteAllText(@"C:xy.html", htmlSourceString);

My problem is that when I open the downloaded file, the text on the images are automatically extracted into html paragraphs, and the images are lost.

How can I disable this transoformation option?

UPDATE
Thank you for your reply! Now I understand that I have to download the images individually.

But I'm still curious: Why is this transformation happening?
I have made a pic to demonstrate what I'm exactly talking about. click for the pic

Community
  • 1
  • 1
jeti
  • 1,650
  • 1
  • 19
  • 28
  • 5
    Images have never been an actual "part" of the HTML file - if you want to download them, you'll need to download the image individually. You can see all the things your browser loads separately on any page in the Network tab of your developer tools (F12 in any browser) – Katana314 Dec 23 '15 at 16:46

1 Answers1

0

After saving the html you will have to parse it. http://www.codeplex.com/htmlagilitypack is a good parser for html parsing. I've used it myself many times.
Then with the parser you will find all the <img> nodes and take their respective src attribs. Those attribs will contain either absolute or relative urls. If they are absolute it's easy. You can just use them to download the images. If they are relative you will have to find the part that makes them absolute and prefix all the urls with that. At that point you can download all the images again.

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
woutervs
  • 1,500
  • 12
  • 28