0

i'm having a problem with encoding in c#

i'm downloading an xml file encoded in windows-1250 and then, when saved to a file, special characters like Š and Đ are replaced with ? even tho the file is saved correctly using the windows-1250 encoding.

this is an example of my code (simplified):

var res = Encoding.GetEncoding("Windows-1250").GetBytes(client.DownloadString("http://link/file.xml"));
var result = Encoding.GetEncoding("Windows-1250").GetString(res);
File.AppendAllText("file.xml", result);

the xml file is in fact encoded using windows-1250 and it reads just fine when i download it using the browser.

anyone knows what's going on here?

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
Fakku
  • 51
  • 2
  • 6
  • 1
    The problem is caused by the code itself. There are no special characters as .NET strings are Unicode. `DownloadString()` returns a *Unicode* string. There's no need to force any conversion to Windows-1250. In the end, `result` is *still* a `string`, still Unicode. – Panagiotis Kanavos Feb 28 '19 at 15:05
  • probable solution https://stackoverflow.com/questions/5568033/convert-a-strings-character-encoding-from-windows-1252-to-utf-8 – Gianmarco Varriale Feb 28 '19 at 15:06
  • The proof is your own question. StackOverflow is an ASP.NET web app running on top of SQL Server, storing questions in Unicode (nvarchar) fields. If any kind of conversion was needed you wouldn't be able to type `Š` and `Đ` – Panagiotis Kanavos Feb 28 '19 at 15:06
  • 1
    Long story short, just remove all your code and write `var xml=client.DownloadString(..); File.AppendAllText("file.xml",xml);` – Panagiotis Kanavos Feb 28 '19 at 15:07
  • How about using `client.DownloadFile("http://link/file.xml", "file.xml");`? – Renatas M. Feb 28 '19 at 15:09
  • Another thing `even tho the file is saved correctly using the windows-1250 encoding.` is wrong. The default encoding used by `File.AppendAllText` is UTF8 [as the source code shows](https://referencesource.microsoft.com/#mscorlib/system/io/file.cs,1141) – Panagiotis Kanavos Feb 28 '19 at 15:13
  • @Fakku even if the server file was encoded in Windows-1250, `DownloadString` would still return the correct string *provided the web server returned the correct encoding in the `Content-Type` header*. – Panagiotis Kanavos Feb 28 '19 at 15:17
  • @PanagiotisKanavos using your approach and removing explicit encoding functions the file is saved as windows-1250 but all special characters are replaced with "ďż˝" which is not what i need. the problem i have is that the web server which serves the xml file doesn't set the correct encoding in header, thus i have to manually specify in my code. – Fakku Feb 28 '19 at 23:59
  • @Reniuz it's the same thing, i don't actually even have to save anything to a file, i'm just using AppendAllText for debug purposes, what my code does is it deserializes the xml and the situation is the same, encoding problems. – Fakku Feb 28 '19 at 23:59
  • 1
    @Fakku if the *server's* encoding is wrong the text returned by DownloadString is *already* mangled. You'll have to use `DownloadData` and `Encoding.GetString()` on the buffer result or `DownloadStream` and a StreamReader with the Windows-1250 encoding. – Panagiotis Kanavos Mar 01 '19 at 07:44
  • @Fakku another option is to set the `WebClient.Encoding` parameter to the encoding you expect. `DownloadString` will try to guess the encoding if there's no `charset` parameter and fall back to the encoding specified in `WebClient.Encoding`. [The default value](https://referencesource.microsoft.com/#system/net/System/Net/webclient.cs,44) is .... `Encoding.Default`, the encoding that corresponds to *your account's locale*. – Panagiotis Kanavos Mar 01 '19 at 07:54
  • @PanagiotisKanavos that was causing the problem, i had a custom webclient class which was setting the webclient encoding to utf8. i've set it to 1250 and now it's working. – Fakku Mar 01 '19 at 11:18
  • @Fakku UTF8 is the defacto standard. The server should either include a `charset` parameter or use UTF8, not a local codepage. I'd bet it's an old system from the very early 00s if not late 90s, back when people didn't use UTF8 because ... there were too few users and not enough Unicode fonts. Web page encoding was a common problem back then until everyone settled on UTF8. Back then people often had to select the correct web page codepage from the browser's menu to read a page – Panagiotis Kanavos Mar 01 '19 at 11:25

1 Answers1

0

The problem could result from two different sources, one at the beginning and one at the end of your snippet. And as has been pointed out, the Encoding and Decoding you are doing in your code is actually useless, because the origin (what DownloadString returns) and target (the variable result) are both C# Unicode strings.

Source 1: DownloadString

DownloadString could not properly decode the Windows-1250 encoded string, because either the server did not send the correct charset in the Content-Type header, or DownloadString doesn't even support this (unlikely, but I'm not familiar with DownloadString).

Source 2: File.AppendAllText

The string was downloaded correctly, then encoded in memory to Windows-1250, then decoded to a Unicode string again and everything worked well. But then it was written by File.AppendAllText in another default encoding. AppendAllText has an optional, third parameter that you can use to specify the encoding. You should set this to Windows-1250 to actually write a file in Windows-1250 encoding.

Also, make sure that whatever editor you use to open the file uses the same encoding - this is often not very easy to guarantee, so I'd suggest you open it in a "developer-friendly" editor that lets you specify the encoding when opening a text file. (Vim, Emacs, Notepad++, Visual Studio, ...).

Daniel Albuschat
  • 806
  • 8
  • 20
  • The problem was *caused* by the attempt to force *use* Windows-1250. Additional code will only make things worse. Hard-coding the encoding in `AppendAllText` is equivalent to what the OP already did – Panagiotis Kanavos Feb 28 '19 at 15:08
  • Actually it's not the same. `result' is a C# string, which does itself not carry encoding information, so whenever you want to write it in a specific encoding, you have to specify it explicitly. I had the impression that the OP had to handle encoding in an explicit way, after the lines he posted. But you are right - when using DownloadFile and AppendAllText without all the Encoding mumbo jumbo in between, the result should be identical (as long as the file content can be presented in the Encoding he chose) – Daniel Albuschat Feb 28 '19 at 15:48
  • @DanielAlbuschat i tried with 3rd parameter but the situation is the same. as i specified in my reply above my code isn't actually saving to a file, i'm deserializing it but i'm using AppendAllText for debugging purposes. the deserialized values have the same encoding problems which makes me think the DownloadString is the problem – Fakku Feb 28 '19 at 23:59
  • @Fakku You can use Fiddler (https://www.telerik.com/fiddler) to sniff into the request and check whether the Content-Type includes the proper charset. – Daniel Albuschat Mar 01 '19 at 06:55