1

My question is simple: Are strings in .net encoding agnostic?

I ask this because when I ingest an xml file that I know was encoded with some windows-1252 code page elements (i.e smart quotes), in the debugger viewing the string that is holding my xml seems to want to resolve the single "smart quote" to a triangle with a question mark in it. This makes me wonder if .NET is asserting that the string that is holding my XML is UTF8 and therefore cannot resolve the difference.

This is a problem, if so, because if the string gets converted then my webservice that is meant to scrub the windows smart quotes from my text will fail because it doesn't recognize the triangle/question-mark-thingy.

Please help.

Isaiah Nelson
  • 2,450
  • 4
  • 34
  • 53
  • Are the smart quotes string data or are they delimiting attribute values (e.g., ``)? If the latter, then your XML is not compliant; you'll need to replace the characters with straight quotes. – phoog Jan 20 '12 at 21:35
  • Furthermore, this question doesn't really solve your problem - what you really want to do is simply change the character encoding that the library is using for your .xml file. Have a look at this: http://stackoverflow.com/questions/961699/how-to-change-character-encoding-of-xmlreader – Michael Ratanapintha Jan 20 '12 at 21:37
  • Som helpful reading: http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp and http://www.yoda.arachsys.com/csharp/strings.html – Matt Smith Jan 20 '12 at 21:37

5 Answers5

5

Strings are always UTF-16. Any incoming or outgoing data must be converted to/from that encoding.

If you use a proper XML reading library, it will most likely handle it for you, as long as the XML has the appropriate XML prolog (but Windows-1252 support is not required for compliance with the XML specification).

Michael Madsen
  • 54,231
  • 8
  • 72
  • 83
  • 1
    For additional documentation see System.Char http://msdn.microsoft.com/en-us/library/system.char.aspx – Paul Keister Jan 20 '12 at 21:36
  • @michaelmadsen I am using linq to xml to extract the items I am looking for into a collection of strings essentially. Hence why I am concerned about the automatic conversion. – Isaiah Nelson Jan 20 '12 at 22:45
  • @michael madsen what is a xml reading library? I am using linq to xml is that what you are talking about? – Isaiah Nelson Jan 23 '12 at 18:39
  • @fullNelson: An XML reading library is - well, a library for reading XML. If you're using LINQ to XML, I'd expect it to handle encoding conversions for you, assuming the XML specifies it correctly. – Michael Madsen Jan 23 '12 at 19:47
  • @michael madsen The XML my clients are delivering does not have an encoding declaration. I am fairly certain they are using windows 1252 codepage, but it may be that they are also copying and pasting their xml together. Since this is the case then I am not sure there is much I can do to detect their encodings. – Isaiah Nelson Jan 23 '12 at 19:51
  • @fullNelson: If you *know* the encoding, you can read the file as plain text and do the conversion manually - but you *really should* get the source to do things right; whatever that may take - most likely, the XML is being generated by some code, so that code should be fixed. This is going to be an issue every time someone has to use these XML files, because no conforming XML reader can be expected to understand them. – Michael Madsen Jan 23 '12 at 19:57
  • @michael madsen Can you elaborate on what you mean by "doing the conversion manually?". I will still need to read the xml with linq ultimately but if I have to "pre-process" the file as you say then I definitely will, just let me know what you have in mind. – Isaiah Nelson Jan 23 '12 at 20:04
  • @fullNelson: Load the file as plain text, manipulate it as appropriate, and then load the manipulated XML (no idea if LINQ to XML can load from a string or if you need to write it to a file). I can't be more specific than that, since I don't know what the exact issue with your XML is. – Michael Madsen Jan 23 '12 at 22:04
2

.NET uses UTF16 for all strings in memory (surrogate characters may be thrown in where need be).

When loading some text file it either defaults to interpreting the file as UTF-8 or whatever encoding you tell it to use.

Since you don't show any source code I can only speculate how you read/load the XML and if the XML has the proper charset in its prolog... depending on the method .NET will default to UTF-8 and represent that as UTF16 in memory...

Please provide more details if the above didn't help...

Yahia
  • 69,653
  • 9
  • 115
  • 144
  • I am reading the xml with linq and then passing it to a string. Our clients are not using any indication of what encoding they are using but they are using a windows code sheet of some type (1252 or 1254). – Isaiah Nelson Jan 20 '12 at 22:50
  • @fullNelson IF the XML doesn't contain the used encoding it will be interpreted as UTF-8. if it is something else then the XML is **non-conforming** and thus any stadards-compliant XML-reader won't read it correctly... the XML standard says that the XML MUST contain an indication of the used encoding/charset (UTF-8 as default being an exception) ! – Yahia Jan 21 '12 at 07:27
  • @-yahia My client isn't sending the xml with an encoding declaration. But since I know what they are trying to send, can I prepend the xml with a declaration to fix this issue? – Isaiah Nelson Jan 23 '12 at 18:20
  • @fullNelson yes, that is a bit hacky but by treating the file as a binary file and rewriting it to include the encoding declaration it should fix it... – Yahia Jan 23 '12 at 18:29
  • @-yahia Can you explain what you mean by treating the file as binary. Are you saying serialize to binary before it is read by the xml parser? I am using Linq to XML by the way. Let me know if that changes anything. – Isaiah Nelson Jan 23 '12 at 18:38
  • 1
    @fullNelson open it via FileStream and don't use any of text-specific methods (like ReadLine etc.). just read it as a byte stream, make the needed modifications to the header and save the file with a new name... then you just use that new file with your Linq2XML. – Yahia Jan 23 '12 at 18:52
1

No, strings in .NET are stored as Unicode codepoints in a limited 16-bit range. For those that overflow, surrogate characters are used.

Do not confuse the above-mentioned in memory representation with storage representation which highly depends on the chosen encoding scheme.

leppie
  • 115,091
  • 17
  • 196
  • 297
1

The string class is (mostly) encoding-agnostic. You error comes from the process of decoding bytes to a string. This process does not work for you. You need to tell the decoder to use your special encoding.

Why are strings only mostly agnostic? That is because they encode unicode chars as sequences of 16-bit values. But although a 16 bit value has only 64k possible values, a unicode char can have about 1 million different values. Therefore an encoding process needs to happen as well. This happens through the use of surrogates. The string class is essentially UTF-16.

usr
  • 168,620
  • 35
  • 240
  • 369
  • Since the windows-1252 code list states that its double smart quote is 132, this won't map to utf8 at all then because utf8's codepoint 132 isn't the same. So changing it to UTF16 still doesn't solve the problem. – Isaiah Nelson Jan 20 '12 at 22:53
  • I did not suggest that. I suggested to configure your XML reading library to use the windows-1252 encoding. It will then use this encoding to decode the input bytes to .NET strings (which are UTF-16). I repeat: your problem has nothing to do with the string class. – usr Jan 21 '12 at 10:53
  • Where can I find more information on the xml reading library. I have never heard of it. – Isaiah Nelson Jan 23 '12 at 15:21
  • I assumed you were using one because you were talking about XML. I must have misunderstood you. – usr Jan 23 '12 at 19:25
0

No. From MSDN:

A string is a sequential collection of Unicode characters . . .

Wyatt Barnett
  • 15,573
  • 3
  • 34
  • 53
  • That is incorrect. There is nothing sequential with the presence of surrogate characters. – leppie Jan 20 '12 at 21:36
  • @leppie the presence of surrogates doesn't stop a string from being a sequential collection of Unicode characters. It means that some of the Unicode characters are encoded by more than one `char`, but the string is still sequential. – phoog Jan 20 '12 at 21:42
  • @leppie -- moreover, take the quibble up with microsoft, that is a direct quote from the documentation . . – Wyatt Barnett Jan 20 '12 at 21:44
  • @WyattBarnett: MSDN is notoriously wrong, especially when you quote the .NET 1.1 documentation. – leppie Jan 20 '12 at 21:51
  • @phoog: It is not sequential, as that would imply string indexing is a O(1) operation (for any string). – leppie Jan 20 '12 at 21:53
  • @leppie : that line does *not* change between .NET versions. String didn't much change between .NET versions either to boot. – Wyatt Barnett Jan 20 '12 at 21:55
  • @WyattBarnett: Still does not mean the documentation is strictly correct compared to what the code does. – leppie Jan 20 '12 at 21:57
  • @leppie Sequential and (efficiently) indexable are not the same. Consider the linked list, which is an ordered sequence but not O(1) indexable. Besides, string indexing *is* O(1), although the indexer returns a `char`, not a Unicode character. – phoog Jan 20 '12 at 21:58
  • @phoog: Ordinal indexing, yes, but only then. Try writing some multi-cultural code, and you will find ordinal indexing is useless in most cases. Eg cannot be used for any string casing/comparison operations, etc. – leppie Jan 20 '12 at 22:01
  • @leppie of course, but that doesn't stop the string from being sequential. – phoog Jan 20 '12 at 22:05
  • @phoog: As sequential as a linked list, sure :) – leppie Jan 20 '12 at 22:06
  • @phoog: The fact is that the MSDN description is vague and does not really describe how Unicode codepoints are represented. – leppie Jan 20 '12 at 22:08
  • @leppie Agreed 100%. I wouldn't be surprised if nobody on my team knows the first thing about surrogates. – phoog Jan 20 '12 at 22:12