0

I've seen in other questions(like How can I detect the encoding/codepage of a text file) that it is impossible to identify a file's encoding.

I've also found a function that identifies encoding (Determine a string's encoding in C#) but its heuristic and depends on file contents.

I wondered if I included a certain string at the beginning of each file it would allow to uniquely identify its encoding, due to different byte-char mappings.

If its not possible for all encodings, it should be good enough to identify the most common ones (http://w3techs.com/technologies/overview/character_encoding/all)

Community
  • 1
  • 1
galmeida
  • 221
  • 2
  • 10
  • 1
    You'd have to take the [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) into consideration for unicode text. What problem are you trying to solve? – stuartd Mar 30 '16 at 14:22
  • @stuartd The problem is to write code that is able to handle any encoding, as said. Yes, I guess BOM should be addressed, but its detection could be incorporated into the code, right? Thats the first thing the funcion in http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp does. – galmeida Mar 30 '16 at 14:33
  • Are you going to deal with raw bytes every time? Because you could simply include encoding information with data, then you don't need to "determine" anything next time. E.g. single byte (or make it proper header) before string bytes: 0 - unknown (require to determine or to ask user), 1 - utf8, 2 - unicode, .. etc. – Sinatr Mar 30 '16 at 15:17
  • @Sinatr The idea is to handle data from layman users on an arbitraty text editors. I don't expect them to known what encoding they are using. They could simply put the string at the beginning of the file, which is a simple instruction to give to them. – galmeida Mar 30 '16 at 15:49
  • Maybe you can look into sources of one of such arbitrary text editor to see how they deal with encoding. I've Far Manager and it often fails to set proper code page (code page = encoding?) for different old text files which I am lazy to encode into one. In any case you can provide users with either requirements (so they must use encodings your software can detect and use) or options (so they can configure encoding in your software). – Sinatr Mar 30 '16 at 15:55

0 Answers0