How to find whether the stream has an unicode

Question

I am having a file name "Connecticut is now 2 °C.txt" which contains a unicode but the file contents are just normal characters. Previously the code was used to identify whether the file name has unicode if so the file header was written with the unicode details. This way of implementation leads to conflict in the output file. So can anyone suggest how to find whether the file stream has an unicode in it.

Thanks in advance,

Lokesh.

By the way, "a unicode" is not "a unicode character" - and even the last is not what you want. I think you want something like "a character outside of my usual character set", with some definition of "usual character set" (like ASCII, Latin-1 (ISO-8859-1), Windows-1252). — Paŭlo Ebermann, Mar 28 '11 at 07:28
"Having a unicode" is a misnomer, Unicode is an encoding, either the filename is encoded as unicode or it's not, but looking at it is no way to determine, it's how it is encoded for storage that matters. As @Paulo mentioned, a better definition is "containing a character I don't like". — Lasse V. Karlsen, Mar 28 '11 at 10:04

David Heffernan · Answer 1 · 2011-03-28T10:00:43.030

2

By far the simplest strategy is to decide on an encoding for a particular file, e.g. UTF-8, and use it exclusively, both when you write it and then when you read it. Trying to detect what encoding is in use is decidedly error prone so it's best not to have to do this detection.

UPDATE

In the comments below you clarify that you wish to write to a file that is created by somebody else with an unknown encoding.

In full generality this is impossible to do with 100% reliability.

If you are lucky then you may find that the file comes with a Byte Order Mark (BOM). In which case you can read the BOM and thus infer the encoding. There's no requirement for a text file to contain a BOM and they frequently don't.

However, I would urge you to agree an interchange format with whoever is creating these files. Pick a single encoding and always use it.

edited Mar 28 '11 at 10:00

answered Mar 28 '11 at 07:24

David Heffernan

601,492
42
1,072
1,490

Is there any simple api to find out or we need to write our custom way to identify it... – Lokesh Mar 28 '11 at 08:40
@user What are you trying to do? Where do the files come from? Are you writing these files? I assumed that you were. – David Heffernan Mar 28 '11 at 08:47
@David Yes i am writing some random values to the files,but not unicode characters. The file name is as i have previously mentioned it has unicode, but our code base checks for the unicode in the file name and sets the unicode for writing the file content. Can you suggest how to identify whether the file has unicode or not – Lokesh Mar 28 '11 at 09:48
@user Since you are writing the file you can decide which encoding to use. You don't need to detect it. Which encoding are you using? UTF-8? UTF-16LE? UTF-16BE? UTF-32LE? UTF-32BE? – David Heffernan Mar 28 '11 at 09:50
I would like to rephrase the query, is there possible to identify in which encoding type the file was rendered. Because for testing phase i am writing some random value, but actually our client will pass the file. I hope you can understand the actual requirement. – Lokesh Mar 28 '11 at 09:54
@user I've added an update to the question. But please tell your client to use a well-defined, specified encoding to create the file. If you don't do this you'll end up with all sorts of problems down the line. – David Heffernan Mar 28 '11 at 10:01

score 0 · Answer 2 · edited May 23 '17 at 11:47

0

I think this link would be helpful for you. Pay attention to IsTextUnicode Function

edited May 23 '17 at 11:47

Community

1
1

answered Mar 28 '11 at 07:13

Anton Semenov

6,227
5
41
69

How to find whether the stream has an unicode

2 Answers2