Delphi - get encoding for a given file

Question

I have read this question which I thought would give me what I was after:

How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

I would like to know if there is another way to get the file encoding, without using Mozilla's i18n component in D2006? I can not use other 3d party components.

I have read the all the answers from original question, and I can not use the interface provided because the client doesn't accept the deployment of that dll:

first answer - https://stackoverflow.com/a/373103/368364 - nothing conclusive.
second answer - http://www.siao2.com/2007/04/22/2239345.aspx - regarding and reading the comments give you a clue!
third answer - How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing? - user choice the encoding.

Some of the links provided in the original question are dead, and none address my problem, which is:
How to get the file encoding without using 3rd party components?

@RobKennedy - it is not a duplicate. I've mentioned in question, that I already read that. And the answers to the question mentioned are talking about try this and try that, without solving the problem. and the guy who asked was resolving the problem by using the interface provided here: http://sourceforge.net/projects/chsdet/files/ — RBA, Feb 02 '12 at 17:54
Look for BOM; if BOM not found, ask a user to set an encoding. — kludg, Feb 02 '12 at 18:13
If you're unsatisfied with the answers to that question, that doesn't mean it isn't still the same question you're asking. That question asked how to guess the file encoding. You're asking the same thing. The accepted answer there is to use Chardet, but there are other answers, including one telling you to use Notepad's algorithm, followed by a couple of other algorithm descriptions. If you're not going use a library, and you don't like the built-in API, then the *only* answers you're going to get are algorithm descriptions. How is your question different from the original? — Rob Kennedy, Feb 02 '12 at 18:55
1) first answer - http://stackoverflow.com/a/373103/368364 - nothing concludent. — RBA, Feb 02 '12 at 20:16
@RobKennedy - notepad algorithm from the link - 'For the record, here is the official, UNDOCUMENTED...'. — RBA, Feb 02 '12 at 20:50
Attacking the quality of the *answers* is not what's going to convince me that this is a different *question*. The answers don't matter. Your question asks the same thing as the other question. As for drawing attention to an old question, I think your initial action was fine: Re-ask the question. If there are new answers, people can add them to the old question. You can also start a bounty on the old question. There are lots of posts on Meta about how to draw attention to an old post. — Rob Kennedy, Feb 02 '12 at 23:06
Agree with Rob, it was a good idea to re-ask the question to draw an attention, but it is a duplicate. You should better ask for a detailed description of some foolproof algorithm than for a code because if you cannot not use a 3d party solution then you'll just have to write it by your own. And a good file encoding detection code wouldn't fit the post length here at all. It's not as easy as it seems to be. — TLama, Feb 03 '12 at 00:55
He's asking for "another way to get the file encoding, without using Mozilla's i18n component in D2006 [because he] can not use other 3d party components." Seems valid enough to me - he's done his research, unfortunately can't use the answer to the other question, and is asking if there's an alternative. An alternative (different answer) probably warrants a new question, since you can't have two accepted answers on one question. — David, Feb 03 '12 at 09:38
@David M, well, so let's reopen this Q and see if we will find someone who has its own solution and provide it here with the meaning of free for every use license. Moreover if it will be good enough and with unit tests included then I'll offer a bounty for it. — TLama, Feb 03 '12 at 11:33
@RBA: if you want to discuss why a question was closed, open a question on meta.stackoverflow.com, don't deface your question here. — Mat, Feb 03 '12 at 13:07
@Mat - if you delete my LE, no one will understand which is the purpose of this question. So, I believe that the points explained in the LE should remain. In this way, everybody can understand why the other question does not satisfy what I am asking. — RBA, Feb 03 '12 at 14:52
@RBA: the blurb you added read like a rant. Always avoid that. I read it all, and tried to format it so that it shows why you think it's not a dupe, and removed all the irrelevant "noise". Please do it like that next time this happens (if it ever does) - always keep your question strictly on the topic you're after, dont put any "editorializing" thing in it, certainly not addressing commenters/close-voters in there; all those things become useless if the question is re-open, and tend to make people less likely to actually read and process what you typed (like I did). (Voted to reopen.) — Mat, Feb 03 '12 at 15:04
@Mat - nice work on the edit! Mat and RBA, rants aren't good, but I understand RBA's frustration. SO can be a bit, um... not unfriendly, but non-understanding. (I think it's the 'not a bug, by design' mentality, because one can be right, but completely useless, as perhaps happened when closing this question. It's not a user-oriented POV.) RBA, Mat's edit has *considerably* improved the question and so I'd recommend if you can, to copy his question-writing style. — David, Feb 03 '12 at 16:02
RBA: on topic: how accurate do you need to be? Do you just need to differentiate between, say, UTF8 and UTF16, or ANSI but assume it's the current codepage, or ANSI but you do't know what codepage it was written in, or...? The answer to these questions will define the scope of the code. I can help with a narrow scope (ANSI local, UTF8, UTF16) but not detecting ANSI (unknown codepage), for example. — David, Feb 03 '12 at 16:06
@DavidM - I need only to see if it is UTF8, UTF16 or ANSI. I don't know the code page on which the files were written. So, for this question, if I can say that a file was written in ANSI local, or UTF8 or UTF16 it will be a good start for me. I need to be as accurate I can be, without spending a month on research for all the possible code pages. — RBA, Feb 03 '12 at 16:24
The easiest way there is to follow @David Heffernan's advice and check for a BOM. This is a few bytes at the beginning of the file that specifies if it's UTF8 etc. If nothing is there, assume it's ANSI. — David, Feb 06 '12 at 13:54

score 4 · Accepted Answer · answered Feb 02 '12 at 17:50

4

I would look for a BOM first and if one is not found call IsTextUnicode. But beware that no method is foolproof.

answered Feb 02 '12 at 17:50

David Heffernan

601,492
42
1,072
1,490

3

`IsTextUnicode` should be dealt with care. see [this](http://en.wikipedia.org/wiki/Bush_hid_the_facts) – kobik Feb 02 '12 at 18:00
2

@kobik hence my final sentence. This is an inexact science. – David Heffernan Feb 02 '12 at 18:02

score 1 · Answer 2 · answered May 12 '12 at 17:27

1

Determining the encoding of a file seems to be problematic. It appears that some of the UTF8 files do not have a BOM. This appears to work:

InputData.LoadFromFile(f,TEncoding.UTF8);
if InputData.count=0 then
  InputData.LoadFromFile(f);

Is there a better approach. I know this solution isn't very elegant.

answered May 12 '12 at 17:27

bobonwhidbey

485
7
17

Use `TEncoding.GetBufferEncoding()` before calling `LoadFromFile()`, or simply omit the Encoding parameter and let `LoadFromFile()` call `GetBufferEncoding()` internally for you. – Remy Lebeau May 13 '12 at 02:07

Delphi - get encoding for a given file

2 Answers2