How to detect if user selected .txt file is Unicode/UTF-8 format and Convert to ANSI

Question

My non-Unicode Delphi 7 application allows users to open .txt files.

Sometimes UTF-8/UNICODE .txt files are tried to be opened causing a problem.

I need a function that detects if the user is opening a txt file with UTF-8 or Unicode encoding and Converts it to the system's default code page (ANSI) encoding automatically when possible so that it can be used by the app.

In cases when converting is not possible, the function should return an error.

The ReturnAsAnsiText(filename) function should open the txt file, make detection and conversion in steps like this;

If the byte stream has no bytes values over x7F, its ANSI, return as is
If the byte stream has bytes values over x7F, convert from UTF-8
If the stream has BOM; try Unicode conversion
If conversion to the system's current code page is not possible, return NULL to indicate an error.

It will be an OK limit for this function, that the user can open only those files that match their region/codepage (Control Panel Regional Region Settings for non-Unicode apps).

Take a look at this: https://stackoverflow.com/questions/16240354/tstringlist-behavior-with-non-ansi-files/16240658?r=SearchResults&s=2%7C44.5336#16240658 — Tom Brunberg, Dec 09 '22 at 08:29
Delphi 7 can perfectly compile programs which support Unicode everywhere - just use the old [TntWare Unicode Controls](https://github.com/rofl0r/TntUnicode) and prefer `Widestring` everywhere. There's no point and no need in truncating user input down to ANSI. — AmigoJack, Dec 09 '22 at 10:12
If the file contains Unicode, it is not possible to convert it to ANSI, if any non-ANSI characters are in it. — mjn, Dec 09 '22 at 14:24
i remembered i used something called i18n ported to delphi, but i cant find the original Component i used back then. I Only found a reference in mORMot: https://github.com/synopse/mORMot/blob/master/SQLite3/mORMoti18n.pas. But to use that you have to investigate how mORM does it. As far a i rember it scanned the string and decides based on the bits which local language had the highes probability. If your files have BOM Headers Tom Brunsbergs Link should be the first part of a solving strategy. — Quelltextknecht, Dec 10 '22 at 13:20
Thanks for all comments, it seems the default encoding for notepad is UTF-8 nowadays so detecting/converting that should be enough.. About converting the whole app to Unicode: Don't fix it if its not broken ;) — Tom, Dec 12 '22 at 06:55
Your program **is surely** broken as soon as it encounters filenames/paths that not only have ASCII characters. Also "_nowadays_" implies your program users always use an up-to-date OS version **and** they always use Notepad **and** they always create new files - I think this is way too optimistic. — AmigoJack, Dec 14 '22 at 12:04
During the last 20 years, I have not received complaints about non-ascii filename issues from our millions of users. However, sometimes users do try to import .txt files that are not plain text ANSI, so that is why I want to auto-detect utf-8 / BOM and solve that with a warning message or auto-convert these. The workaround for these users is currently to Save As from Notepad with "Ansi encoding". — Tom, Dec 16 '22 at 09:20
Even if it's super easy to report something to you it doesn't mean everybody will do it and everything will be reported. I wouldn't report it either and just be disappointed by the program, working around the issue via 8.3 filenames. Also "_no bytes values over x7F_" is ASCII, not ANSI. Good luck with distinguishing [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252) from [Windows-1251](https://en.wikipedia.org/wiki/Windows-1251). — AmigoJack, Dec 19 '22 at 15:57
Thanks, its enough good solution to detect which code page the user currently has and assume its in use with the file that they just saved with notepad, for example https://stackoverflow.com/questions/909913/how-can-i-programmatically-determine-the-current-default-codepage-of-windows — Tom, Dec 21 '22 at 06:47

James Risner · Answer 1 · 2022-12-20T11:32:59.137

1

The conversion function ReturnAsAnsiText, as you designed, will have a number of issues:

The Delphi 7 application may not be able to open files where the filename using UTF-8 or UTF-16.
UTF-8 (and other Unicode) usage has increased significantly from 2019. Current web pages are between 98% and 100% UTF-8 depending on the language.
You design will incorrectly translate some text that a standards compliant would handle.

Creating the ReturnAsAnsiText is beyond the scope of an answer, but you should look at locating a library you can use instead of creating a new function. I haven't used Delphi 2005 (I believe that is 7), but I found this MIT licensed library that may get you there. It has a number of caveats:

It doesn't support all forms of BOM.
It doesn't support all encodings.
There is no universal "best-fit" behavior for single-byte character sets.

There are other issues that are tangentially described in this question. You wouldn't use an external command, but I used one here to demonstrate the point:

% iconv -f utf-8 -t ascii//TRANSLIT < hello.utf8
^h'elloe
iconv: (stdin):1:6: cannot convert
% iconv -f utf-8 -t ascii < hello.utf8
iconv: (stdin):1:0: cannot convert

Enabling TRANSLIT in standards based libraries supports converting characters like é to ASCII e. But still fails on characters like π, since there are no similar in form ASCII characters.

edited Dec 20 '22 at 11:32

answered Dec 18 '22 at 11:32

James Risner

5,451
11
25
47

Most file systems under Windows use UTF-16, not UTF-8. [ICONV already comes in a DLL](https://stackoverflow.com/q/14140315/4299358), so one can use that instead of starting processes. – AmigoJack Dec 19 '22 at 15:48
@AmigoJack Thanks! I edited my answer to address your points. They can't use iconv for their purpose (the OP). – James Risner Dec 19 '22 at 15:52
Thanks, calling an external app to convert is an interesting idea, however I look forward to getting a native Delphi solution. – Tom Dec 21 '22 at 06:43
@tom it seems I’m being unclear I’ll remove the reference to the external app, the core of the answer is a Delphi library https://github.com/d-mozulyov/UniConv – James Risner Dec 21 '22 at 10:06
thanks, I checked the library, unfortunately it did not compile with D7, for example these rows; type LatinString = type AnsiString(1250); IsoLatinString = type AnsiString(28592); – Tom Dec 22 '22 at 13:48
OR ROW List.SaveToFile(CORRECT_FILE_NAME, TEncoding.UTF8); [Error] FileConversion.dpr(130): Too many actual parameters – Tom Dec 22 '22 at 13:52
Well there may not be a ready made solution/library. Maybe the external program could work well enough? Say check if anything is x80-xFF, pass it thought the application as a filter? – James Risner Dec 22 '22 at 14:12
Library looks good anyway, but does not seem to work.. I tried this function from the lib ConvertTextFile('texts\input ', 'output', bomNone); to convert from UTF8 to ANSI, but it failed. Simply used a letter é and saved input with notepad ansi – Tom Dec 22 '22 at 14:16

score 1 · Answer 2 · answered Dec 22 '22 at 23:19

Your required answer would need massive UTF-8 and UTF-16 translation tables for every supported code page and BMP, and would still be unable to reliably detect the source encoding.

Notepad has trouble with this issue.

The solution as requested, would probably entail more effort than you put into the original program.

Possible solutions

Add a text editor into your program. If you write it, you will be able to read it.

The following solution pushes the translation to established tables provided by Windows.

Use the Win32 API native calls translate strings using functions like WideCharToMultiByte, but even this has its drawbacks(from the referenced page, the note is more relevant to the topic, but the caution is important for security):

Caution Using the WideCharToMultiByte function incorrectly can compromise the security of your application. Calling this function can easily cause a buffer overrun because the size of the input buffer indicated by lpWideCharStr equals the number of characters in the Unicode string, while the size of the output buffer indicated by lpMultiByteStr equals the number of bytes. To avoid a buffer overrun, your application must specify a buffer size appropriate for the data type the buffer receives.

Data converted from UTF-16 to non-Unicode encodings is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data. For more information, see Security Considerations: International Features.

Note The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not.

This solution still has the guess the encoding problem, but if a BOM is present, this is one of the best translators possible.

Simply require the text file to be saved in the local code page.

Other thoughts:

ANSI, ASCII, and UTF-8 are all separate encodings above 127 and the control characters are handled differently.

In UTF-16 every other byte(zero first) of ASCII encoded text is 0. This is not covered in your "rules".

You simply have to search for the Turkish i to understand the complexities of Unicode translations and comparisons.

Leverage any expectations of the file contents to establish a coherent baseline comparison to make an educated guess.

For example, if it is a .csv file, find a comma in the various formats...

Bottom Line

There is no perfect general solution, only specific solutions tailored to your specific needs, which were extremely broad in the question.

How to detect if user selected .txt file is Unicode/UTF-8 format and Convert to ANSI

2 Answers2

Possible solutions

Other thoughts:

Bottom Line