2

I want to check if a file is a plain-text file. I tried the code below:

function IsTextFile(const sFile: TFileName): boolean;
//Created By Marcelo Castro - from Brazil
var
 oIn: TFileStream;
 iRead: Integer;
 iMaxRead: Integer;
 iData: Byte;
 dummy:string;
begin
 result:=true;
 dummy :='';
 oIn := TFileStream.Create(sFile, fmOpenRead or fmShareDenyNone);
 try
   iMaxRead := 1000;  //only text the first 1000 bytes
   if iMaxRead > oIn.Size then
     iMaxRead := oIn.Size;
   for iRead := 1 to iMaxRead do
   begin
     oIn.Read(iData, 1);
     if (idata) > 127 then result:=false;
   end;
 finally
   FreeAndNil(oIn);
 end;
end;

This function works pretty well for text files based on ASCII chars. But text files can also include non-English chars. This function returns FALSE for non-English text files.

Is there any way to check if a file is a text file or a binary file?

Xel Naga
  • 826
  • 11
  • 28
  • 3
    (Off-topic, but still rather important:) You really should replace your `result:=false` with `Exit(False)`. If you find that the file is not a text file at char 2, there is not really any need to keep investigating the remaining 998 chars... – Andreas Rejbrand Jun 12 '20 at 09:46
  • 5
    "Is there any way to check if a file is a text file or a binary file?" In general, no. It is possible for the same file to be a valid text file and a valid binary file when interpreted in different ways. – David Heffernan Jun 12 '20 at 10:24
  • 1
    I agree that there is no definitive answer wether a file is text or not. However, you might not scan for bytes higher than 127, but for 0 bytes (if (idata) = 0 then result:=false;) which might give you a better probability to identify non-text files. This ony applies for ANSI/ASCII/UTF files. – Andre Ruebel Jun 12 '20 at 10:48
  • 4
    @AndreRuebel: Except that UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE text files often have plenty of nulls in them. – Andreas Rejbrand Jun 12 '20 at 10:55
  • 2
    With some effort, it is possible to create an algorithm that makes the right guess in most cases. For instance, you can check if the file would be *invalid* in a particular encoding (then you know it is not a text file in that encoding). You can see if every second byte is null; then it is likely UTF-16. You can try to search for English words. And so on. – Andreas Rejbrand Jun 12 '20 at 10:57
  • 5
    I'm sure what @AndreasRejbrand and DavidH say is correct. Personally I would try a simple statistical analysis based on the frequency of occurence of carriage return (#13) and linefeed (#10) characters. If they always appear together, I think it would be good sign that the file contains text. – MartynA Jun 12 '20 at 11:23
  • 1
    @MartynA note, some text files only have a CR (#13) or LF (#10) as new line character (like macos or linux text files) – R. Hoek Jun 19 '20 at 21:46

1 Answers1

1

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

That's the first answer from here : How can I detect the encoding/codepage of a text file

You also should figure out any binary file can be a text in an uncommun encoding. Also, binary files encoded in Base64 will just bypass any test you will think of, as it is by definition a text representation of a binary stream.

Martial P
  • 395
  • 1
  • 12