1

I want to verify that a given file in a path is of type text file, i.e. not binary, i.e. readable by a human. I guess reading first characters and check each character with :

  • isAlphaNumeric
  • isSpecial
  • isSeparator
  • isOctetCharacter ???

but joining all those testing methods with and: [ ... and: [ ... and: [ ] ] ] seems not to be very smalltalkish. Any suggestion for a more elegant way?

(There is a Python version here How to identify binary and text files using Python? which could be useful but syntax and implementation looks like C.)

Community
  • 1
  • 1
user869097
  • 1,362
  • 8
  • 16

2 Answers2

1

only heuristics; you can never be really certain...

For ascii, the following may do:

|isPlausibleAscii numChecked|

isPlausibleAscii := 
    [:char |
        ((char codePoint between:32 and:127)
        or:[ char isSeparator ])
    ].

numChecked := text size min: 1024.
isPossiblyText := text from:1 to:numChecked conform: isPlausibleAscii.

For unicode (UTF8 ?) things become more difficult; you could then try to convert. If there is a conversion error, assume binary.

PS: if you don't have from:to:conform:, replace by (copyFrom:to:) conform:

PPS: if you don't have conform: , try allSatisfy:

blabla999
  • 3,130
  • 22
  • 24
0

All text contains more space than you'd expect to see in a binary file, and some encodings (UTF16/32) will contain lots of 0's for common languages. A smalltalky solution would be to hide the gory details in method on Standard/MultiByte-FileStream, #isProbablyText would probably be a good choice.

It would essentially do the following: - store current state if you intend to use it later, reset to start (Set Latin1 converter if you use a MultiByteStream)

  • Iterate over N next characters (where N is an appropriate number)

  • Encounter a non-printable ascii char? It's probably binary, so return false. (not a special selector, use a map, implement a new method on Character or something)

  • Increase 2 counters if appropriate, one for space characters, and another for zero characters.

  • If loop finishes, return whether either of the counters have been read a statistically significant amount

TLDR; Use a method to hide the gory details, otherwise it's pretty much the same.

Rydier
  • 156
  • 3