2

I have a problem with a WebRequest in C#. It's a google page.

The header states

text/html; charset=ISO-8859-1

The website states

<meta http-equiv=content-type content="text/html; charset=utf-8">

And finally I only get the expected Result in the debugger as well as regular expression, when I use Encoding.Default which defaults to System.Text.SBCSCodePageEncoding

Now what do I do? Do you have any hints, how this could happen or how I could solve this problem?

The actual Encoding of the page seems to be UTF-8. At least FF displays it correctly in UTF-8, not in Windows-Whatever and not in Latin1.

The URL is this

The problem is the €-sign as well as all German Umlauts.

Thanks in advance for your help on this problem which is making me seriously crazy!

Update: when I output the string via

// create a writer and open the file
TextWriter tw = new StreamWriter("test.txt");

// write a line of text to the file
tw.WriteLine(html);

// close the stream
tw.Close();

it works all fine.

So it seems the problem is, that the debugger does not show the correct encoding, and the Regular Expression also.

How do I tell C# to handle the RegEx as UTF-8?

abatishchev
  • 98,240
  • 88
  • 296
  • 433
Scoox
  • 181
  • 2
  • 11
  • Have you looked into using the `GetBytes()` methods on the relevant encoding classes to convert your string from one encoding to another? – RobV Feb 01 '11 at 13:09

1 Answers1

1

Rather than parsing HTML, why not use the Google Query API?

BTW, before parsing HTML using regexes, read this ;-)

EDIT: In answer to your comment:

  1. The API works for Google Desktop as well.
  2. Is this encoding issue specific to the Google page?
  3. In addition to the problem you have now, who knows what problem you'll run into later, when in production, due to subtle changes in the HTML of these pages, or in the header sent back by the Web server. A web page is supposed to be human eye-friendly, not computer friendly. The only thing you can expect to be friendly is the appearance and rendered contents of the page, not the underlying HTML structure. As opposed to an API, which is supposed to be computer-friendly.
Community
  • 1
  • 1
Serge Wautier
  • 21,494
  • 13
  • 69
  • 110
  • 1) That's for Google desktop 2) I need to pare other pages as well 3) It works perfectly fine except for the encoding issues. – Scoox Feb 01 '11 at 11:45
  • @Scoox: Here is the [correct](http://code.google.com/p/google-api-for-dotnet/) link – abatishchev Feb 01 '11 at 13:05
  • Dear Serge, i understand your comment. However, in this case the regex really fits my needs. There are only ~ 15 pages that need to get parsed, and keeping these regexes up to date is quite possible. The HTML structure could change just as well. And there are no APIs for the other websites. Therefore HTML parsing, whether via regex or XSS like selectors are the only two possible solutions. E.g. afaik for Google Product search (which is used here), there is no API. Thanks anyway, you are right in general. – Scoox Feb 01 '11 at 13:10