7

I'm working on a c# project in which some data contains characters which are not recognised by the encoding. They are displayed like that:

"Some text � with special � symbols in it".

I have no control over the encoding process, also data come from files of various origins and various formats. I want to be able to flag data that contains such characters as erroneous or incomplete. Right now I am able to detect them this way:

if(myString.Contains("�"))
{
   //Do stuff
}

While it does work, it doesn't feel quite right to use the weird symbol directly in the Contains function. Isn't there a cleaner way to do this ?

EDIT:

After checking back with the team responsible for reading the files, this is how they do it:

var sr = new StreamReader(filePath, true);
var content = sr.ReadToEnd();

Passing true as a second parameter of StreamReader is supposed to detect the encoding from the file's BOM, and use it to read the content. It doesn't always work though, as some files don't bear that information, hence why their data is read incorrectly.

We've made some tests and using StreamReader(filePath, Encoding.Default) instead appears to work for most if not all files we had issues with. Expectedly, files that were working before not longer work because they do not use the default encoding.

So the best solution for us would be to do the following: read the file trying to detect its encoding, then if it wasn't successful read it again with the default encoding.

The problem remains the same though: how do we check, after trying to detect the file's encoding, if data has been read incorrectly ?

jjj
  • 1,136
  • 3
  • 18
  • 28
Hal
  • 591
  • 4
  • 10
  • 28
  • 9
    That's *not* a special symbol. That's the [Unicode Replacement Character](http://www.fileformat.info/info/unicode/char/fffd/index.htm). This means that you tried to convert ASCII text using the wrong codepage. Any characters that didn't have a match in the codepage were replaced with `�`. That's bad. The data is lost. If you *actually* saw special characters it would mean that the data was mapped to the wrong character. You could recover it by converting from text to bytes using the inappropriate codepage, then from bytes to Unicode using the correct one – Panagiotis Kanavos Feb 09 '17 at 17:00
  • If you know the list of characters you want to allow, or a list of characters that you want to disallow, you could use regex character classes. Using `�` is, indeed, not a very good idea, because it indicates an error at some earlier stage. – Sergey Kalinichenko Feb 09 '17 at 17:05
  • Sometimes the replacement is a simple question mark (`?`) or a square. The result is the same - characters that didn't match were replaced and lost – Panagiotis Kanavos Feb 09 '17 at 17:05
  • 1
    What is special for you? Non ASCII? If yes look for example at http://stackoverflow.com/questions/1999566/string-filter-detect-non-ascii-signs or http://stackoverflow.com/questions/1522884/c-sharp-ensure-string-contains-only-ascii – Matthias247 Feb 09 '17 at 17:05
  • @dasblinkenlight this doesn't help here. This isn't a special character. This is the result of an incorrect codepage conversion. If for example you tried to parse `This text αυτό το κείμενο` with a french codepage, you'd probably get the same results – Panagiotis Kanavos Feb 09 '17 at 17:06
  • @Hal where does the data come from? How did you read it? Do you know the codepages of the source files? If not, it may not be possible to load them correctly. It's trivial to pass the locale eg to File.ReadAllText, `File.ReadAllText(file, Encoding.GetEncoding(codePage));`. – Panagiotis Kanavos Feb 09 '17 at 17:09
  • 2
    @Hal please post the code you use to read the files, the encoding you tried and the expected codepages. There is no point trying to load a Greek or Romanian file with an English codepage. It's not the file that is erroneous or incomplete, it's the file loading process. You'll have to use the correct encoding for each one. – Panagiotis Kanavos Feb 09 '17 at 17:15
  • 1
    Even if you don't know the actual codepage you could try to determine it, eg by trying all encodings and discarding any that returned even a single `�`, multiple `?` characters or squares. The rest are harder though - one would be the correct one and the others would have some wrong replacements. Finding the correct encoding may require visual inspection of the results. If you know the source of the file you may be able to discard results with unexpected characters. Some countries may have multiple codepages, which can make things a bit harder – Panagiotis Kanavos Feb 09 '17 at 17:18
  • Can you add details on anything else you have considered to resolve this problem. Any other approach ? – Versatile Feb 09 '17 at 17:24
  • if you are reading a simple text file have you tried to set the encoding to the OS's default encoding like: `string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);` this should display the character properly. – JohnG Feb 09 '17 at 17:24
  • @JohnG the OP's problem is that there are multiple sources, not just a single file. – Panagiotis Kanavos Feb 09 '17 at 17:26
  • @Panagiotis Kanavos ... you are correct and I did not read the question close enough. Thank you. – JohnG Feb 09 '17 at 17:27
  • @JohnG moreover, the system codepage is `Encoding.Default`. ASCIIEncoding is a specific encoding. This *could* work though, if there is a misunderstanding about the codepages. In a single company, it's quite likely that most computers will use the country's locale except a few developer machines or servers will be set to US for convenience – Panagiotis Kanavos Feb 09 '17 at 17:29
  • @Panagiotis Kanavos so is there a better way to check if a string contains the Unicode Replacement Character ? – Hal Feb 10 '17 at 09:16
  • You can check the first four bytes of a filed for a Byte Order Mark. You can look up the four bytes involved, or you can get them with `GetPreamble`, eg `UTF8Encoding.GetPreamble()`, `UnicodeEncoding.GetPreamble()`. If there isn't one, you can try with Encoding.Default. A more convenient option though would be to use different incoming folders for Unicode and ASCII files – Panagiotis Kanavos Feb 10 '17 at 09:32
  • Oh, wait, you can pass the default encoding to StreamReader – Panagiotis Kanavos Feb 10 '17 at 09:35
  • Also see http://stackoverflow.com/questions/28216928/encoding-issue-with-string-stored-in-database/28218691#28218691 for additional explanation – helb Feb 10 '17 at 10:00
  • You could also start with the original problem: You don't know the encoding. That's data loss. Go back to the point where the files are written, if you can. – Tom Blodget Feb 10 '17 at 17:39

2 Answers2

6

The � character is not a special symbol. It's the Unicode Replacement Character. This means that the code tried to convert ASCII text using the wrong codepage. Any characters that didn't have a match in the codepage were replaced with �.

The solution is to read the file using the correct encoding. The default encoding used by the File methods or StreamReader is UTF8. You can pass a different encoding using the appropriate constructor, eg StreamReader(Stream, Encoding, Boolean). To use the system locale's codepage, you need to use Encoding.Default :

var sr = new StreamReader(filePath,Encoding.Default);    

You can use the StreamReader(Stream, Encoding, Boolean) constructor to autodetect Unicode encodings from the BOM and fallback to a different encoding.

Assuming the files are either some type of Unicode or match your system locale, you can use:

var sr = new StreamReader(filePath,Encoding.Default, true);

From StreamReader's source shows that the DetectEncoding method will check the first bytes of a file to determine the encoding. If one is found, it is used instead of the supplied encoding. The operation doesn't cause extra IO because the method checks the class's internal buffer

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • I didn't know you could both try to read the BOM and specify an Encoding. Anyway this works for me, thanks a lot. – Hal Feb 10 '17 at 09:50
  • @Hal the StreamReader does what you'd do by hand - reads the first for bytes and checks them against known BOMS. If one matches, it is used. If none match, it falls back the Encoding you supplied – Panagiotis Kanavos Feb 10 '17 at 09:51
  • Which is exactly what I was looking for. – Hal Feb 10 '17 at 09:54
0

EDIT

I just realized you can't actually load the raw file into a .NET string and still be able to have full information about the original file.

The project here uses the Mlang api which does a better job at not loading the file into a .NET string before guessing. There is also a related SO question

Community
  • 1
  • 1
Chibueze Opata
  • 9,856
  • 7
  • 42
  • 65
  • The � character *is* the indicator of a bad match. No further check needed. Besides - UTF8 is the default. The OP is asking for the ASCII files, which can be read simply by using the correct encoding – Panagiotis Kanavos Feb 10 '17 at 09:31
  • @PanagiotisKanavos True. I was suggesting that he reads in the text raw and try to check the characters manually. The article I shared throws more light on this. – Chibueze Opata Feb 10 '17 at 09:35
  • The characters are *not* invalid. It's the conversion that causes the problem – Panagiotis Kanavos Feb 10 '17 at 09:38
  • Reading the raw file means reading the bytes. You can do it with `File.ReadAllBytes`. There is no need for MLang. And [the linked SO question](http://stackoverflow.com/questions/661717/c-cycle-through-encodings/673308#673308)  inside the linked SO question shows just that – Panagiotis Kanavos Feb 10 '17 at 09:45
  • You seem bent on picking issues with my answer. I said clearly *.NET string*. A .NET string is already implemented in UTF-16 so it's impossible to load the data into a string without conversion. The bytes is what makes those answers possible. – Chibueze Opata Feb 10 '17 at 09:47
  • I'm not picking issues, there are serious problems and misconceptions. A file is a bunch of bytes. You can get the bytes. Strings in Windows are Unicode, so all Windows programs since 1995 have to do *something* with those bytes. You can easily check the BOM bytes yourself, or have StreamReader do it. The system locale's encoding is actually the fallback for non-Unicode programs – Panagiotis Kanavos Feb 10 '17 at 09:48
  • Directly from MSDN: "Represents text as a sequence of UTF-16 code units." I also don't know where you got the idea that I don't know a file is a bunch of bytes. – Chibueze Opata Feb 10 '17 at 09:55