1

I wrote a program to crawl website to get data and output to a excel sheet. The program is written in C# using Microsoft Visual Studio 2010.

For most of the time, I have no problem getting content from the website, parse it, and store data in excel.

However, once a will I'll run into issue, saying that there are illegal characters (such as ) that prevents outputting to excel file, which crashes the program. I also went onto the website manually and found other illegal characters such as Ú.

I tried to do a .Replace() but the code can't seem to find those characters.

string htmlContent = getResponse(url); //get full html from given url
string newHtml = htmlContent.Replace("▶", "?").Replace("Ú", "?");

So my question is, is there a way to strip out all characters of those types from a html string? (the html of the web page) Below is the error message I got.

I tried Anthony and woz's solution and that didn't work...

enter image description here

Community
  • 1
  • 1
sora0419
  • 2,308
  • 9
  • 39
  • 58
  • Excel allows those characters. – Joel Coehoorn Dec 11 '13 at 19:19
  • @JoelCoehoorn I put those characters directly in excel sheet and it has no problem. I couldn't do it with the code and the program break, I back track and this is the string that cause issue, and the only character that are suspicious are the ones in my example. – sora0419 Dec 11 '13 at 19:27

3 Answers3

2

See System.Text.Encoding.Convert

Example usage:

var htmlText = // get the text you're trying to convert.

var convertedText = System.Text.Encoding.ASCII.GetString(
    System.Text.Encoding.Convert(
        System.Text.Encoding.Unicode,
        System.Text.Encoding.ASCII,
        System.Text.Encoding.Unicode.GetBytes(htmlText)));

I tested this with the string ▶Hello World and it gave me ?Hello World.

Anthony
  • 9,451
  • 9
  • 45
  • 72
  • Looks like the best answer. – drankin2112 Dec 11 '13 at 19:43
  • @drankin2112 Its my understanding he wants to strip the Unicode characters, at which point the `htmlText` is already Unicode - though I'm not particularly knowledgeable about string encodings out in the wild so I may very well be mistaken. – Anthony Dec 11 '13 at 19:46
  • You're not mistaken, I spoke too soon. I guess I should investigate my answer before posting :) – drankin2112 Dec 11 '13 at 19:47
  • @Anthony Thanks for replying. I tried your method and it's still giving me the error. Please see my update. – sora0419 Dec 11 '13 at 21:50
  • Not at my computer, but looks like you need to pass the string through an XML serializer. It may be worthwhile to see responses to this question: http://stackoverflow.com/questions/157646/best-way-to-encode-text-data-for-xml – Anthony Dec 12 '13 at 02:18
1

You could try stripping all non-ASCII characters.

string htmlContent = getResponse(url);
string newHtml = Regex.Replace(htmlContent, @"[^\u0000-\u007F]", "?");
woz
  • 10,888
  • 3
  • 34
  • 64
  • Thanks for replying. I tried your method and it's still giving me the error. Please see my update. – sora0419 Dec 11 '13 at 21:50
  • Looks like it's falling over on ASCII character 0x10 so probably need to add a separate replace for that specific character (\u0010) – barrowc Dec 12 '13 at 05:28
1

thank you for the replies and thanks for the help.

After couple more hours of googling I have found the solution to my question. The problem was that I had to "sanitize" my html string.

http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/

Above is the helpful article I found, which also provides code example.

sora0419
  • 2,308
  • 9
  • 39
  • 58