2

I have written a web scraping program to go to a list of pages and write all the html to a file. The problem is that when I pull a block of text some of the characters get written as '�'. How do I pull those characters into my text file? Here is my code:

string baseUri = String.Format("http://www.rogersmushrooms.com/gallery/loadimage.asp?did={0}&blockName={1}", id.ToString(), name.Trim());

// our third request is for the actual webpage after the login.
HttpWebRequest request =
(HttpWebRequest)WebRequest.Create(baseUri);
request.Method = "GET";
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1)";
//get the response object, so that we may get the session cookie.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader reader = new StreamReader(response.GetResponseStream());

// and read the response
string page = reader.ReadToEnd();

StreamWriter SW;
string filename = string.Format("{0}.txt", id.ToString());
SW = File.AppendText("C:\\Share\\" + filename);

SW.Write(page);

reader.Close();
response.Close();
Dale Marshall
  • 1,137
  • 7
  • 20
  • 42
  • Encoding issues. Check out SO - http://stackoverflow.com/questions/2700638/characters-in-string-changed-after-downloading-html-from-the-internet/2700707#2700707 – Mikael Svenson Jun 14 '10 at 20:14

3 Answers3

2

You're saving a page named loadimage to a text file. Are you sure that's really all text?

Either way, you can save yourself a lot of code by using System.Net.WebClient.DownloadFile().

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
1

You need to specify your encoding in this line:

StreamReader reader = new StreamReader(response.GetResponseStream());

and

File.AppendText("C:\\Share\\" + filename); uses UTF-8

Gregoire
  • 24,219
  • 6
  • 46
  • 73
0

Specify Unicode encoding, like so:

New StreamReader(response.GetResponseStream(), Text.Encoding.UTF8)

..same for the StreamWriter

Antony
  • 1,451
  • 1
  • 12
  • 25