10

How I can make sure that a file is readable by humans.

By that I essentially want to check if the file is a txt, a yml, a doc, a json file and so on.

The issue is that in the case i want to perform this check, file extensions are misleading, and by that i mean that a plain text file (That should be .txt) has an extension of .d and various others :- (

What is the best way to verify that a file can be read by humans?

So far i have tried my luck with extensions as follows:

private boolean humansCanRead(String extention) {
        switch (extention.toLowerCase()) {
        case "txt":
        case "doc":
        case "json":
        case "yml":
        case "html":
        case "htm":
        case "java":
        case "docx":
            return true;
        default:
            return false;
        }
    }

But as i said extensions are not as expected.

EDIT: To clarify, i am looking for a solution that is platform independed and without using external libraries, And to narrow down what i mean "human readable", i mean plain text files that contain characters of any language, also i dont really mind if the text in the file makes sense like if it is encoded, i dont really care at this point.

Thanks so far for all the responses! :D

fill͡pant͡
  • 1,147
  • 2
  • 12
  • 24
  • 11
    I can't read docx and doc. (And in fact my computer can't either.) – aioobe Jun 22 '15 at 09:44
  • 1
    You can try Apache Tika to get the filetype (based on the contents). – Ouney Jun 22 '15 at 09:44
  • 1
    ...and I can think of a couple of .txt files that I wouldn't be able to read either. (I bet there are a few in French for instance.) What I'm trying to say is that it's impossible to implement this method accurately as you have defined it. – aioobe Jun 22 '15 at 09:46
  • Check this question, might answer your question: http://stackoverflow.com/questions/11901382/how-to-check-if-a-file-is-readable – Roel Strolenberg Jun 22 '15 at 09:46
  • finally [There Ain't No Such Thing as Plain Text](http://blog.codinghorror.com/there-aint-no-such-thing-as-plain-text/). – aioobe Jun 22 '15 at 09:47
  • Note also that one could always rename a binary file to `".txt"`, etc. – Mena Jun 22 '15 at 09:50
  • possible duplicate of [Is there a java library equivalent to file command in unix](http://stackoverflow.com/questions/2729038/is-there-a-java-library-equivalent-to-file-command-in-unix) – slim Jun 22 '15 at 09:52
  • If you narrow the definition of "human" down to something like "Chinese-speaking human" or "English-speaking human" or "computer-programming human" then you can probably get some kind of answer. But humans in general are capable of reading things you wouldn't even think were readable. – biziclop Jun 22 '15 at 09:52
  • As you said extensions are unreliable you can check for mime-type. http://java.dzone.com/articles/determining-file-types-java – Pradeep S Jun 22 '15 at 09:56
  • Dumb idea, but have a dictionary of english words, and match? – xrisk Jun 22 '15 at 09:58
  • The main question is: why do you need this? Maybe the original problem can be solved without it. – biziclop Jun 22 '15 at 10:02
  • @biziclop I see what you did there :3 Checking _every_ language under the sun would be a bit too much, right? – xrisk Jun 22 '15 at 10:03
  • Not to mention character encoding, which can turn (e.g.) plain English text into unreadable gibberish if encoder and decoder don't match – CupawnTae Jun 22 '15 at 10:10
  • 1
    @RishavKundu Yes, and even the characters and encodings are different, which is what makes this problem incredibly hard. On the other hand, what's the point of finding out that a file contains text that someone somewhere may be able to read? If your users can't read the text, it's not much of a consolation for them that there is someone on the planet who can. Which is why I think this is the wrong question altogether. You should ask your users what they can read and only search for those languages/file types, which is a much easier task. – biziclop Jun 22 '15 at 10:10
  • @biziclop I agree with you. I think that the OP should tell us the actual task he is trying to accomplish. – xrisk Jun 22 '15 at 10:15
  • @biziclop ,@RishavKundu I cant narrow it down because it may be any language available as long as it contains characters instead of symbols and bytes :- ( i am making a file indexer that displays all human-readable files folder and also opens them and presents them upon request. So i need a way to seperate the binary files from the text containing ones See the example in the main post. – fill͡pant͡ Jun 22 '15 at 10:25
  • 1
    @fillpant But characters **are** symbols and bytes. You can use certain tricks (like check if the file is a valid UTF-8 sequence or whether a given percentage of the bytes are in the ASCII printable range of 32-127), but it'll never cover all encodings and all languages with 100% accuracy. – biziclop Jun 22 '15 at 10:29
  • @biziclop Thats enlightning as well, i will have a check. So far this is an answer to my question but it has its limitations :- ( P.S. And i thought it would be something so obvius and i am banging my head without a reason @,..,@ Thanks a lot. – fill͡pant͡ Jun 22 '15 at 10:32
  • @fillpant From a useability perspective I think the best way is to treat these guesses as suggestions only, but still allow the user to select any file they want. – biziclop Jun 22 '15 at 10:36
  • Thats what i will do @biziclop but still i have made a preview panel where the readable files will be shown as a small preview but if their preview looks like "PK «L~F ώΚ σMΜΛLK-.Ρ K-*ΞΜΟ³R0Τ3ΰεβε PK²ξ", and although these are valid Greek chars , it will not be ideal D: – fill͡pant͡ Jun 22 '15 at 10:40

2 Answers2

2

In general, you cannot do that. You could use a language identification algorithm to guess whether a given text is a text that could be spoken by humans. Since your example contains formal languages like html, however, you are in some deep trouble. If you really want to implement your check for (a finite set of) formal languages, you could use a GLR parser to parse the (ambiguous) grammar that combines all these languages. This, however would not yet solve the problem of syntax-errors (although it might be possible to define a heuristic). Finally, you need to consider what you actually mean by "human readable": E.g. do you include Base64?

edit: In case you are only interested in the character set: See this questions' answer. Basically, you have to read the file and check whether the content is valid in whatever character encoding you think of as human readable (utf-8 should cover most of your real-world cases).

Community
  • 1
  • 1
choeger
  • 3,562
  • 20
  • 33
  • Thanks, By human readable i mean a file that contains plain text i dont really mind if it is understandable like if it is "FEWRREWGAERGVS" or "How are you doi'n" That i need to worry about later. For now i want to exclude everything exccept files containing plain text. Your answer is enlightning though! Ty! but still i need this kind of seperation. Also as i mentioned in a comment above, the files might be any language. :D – fill͡pant͡ Jun 22 '15 at 10:30
  • Ok the edit also helps but as i posted above^ "i have made a preview panel where the readable files will be shown as a small preview but if their preview looks like "PK «L~F ώΚ σMΜΛLK-.Ρ K-*ΞΜΟ³R0Τ3ΰεβε PK²ξ", and although these are valid Greek chars , it will not be ideal D:" :- ( – fill͡pant͡ Jun 22 '15 at 10:42
1

For some files, a check on the proportion of bytes in the printable ASCII range will help. If more than 75% of the bytes are in that range within the first few hundred bytes then it is probably 'readable'.

Some files have headers, like the various forms of BoM on UTF files, the 0xA5EC which starts MS doc files or the "MZ" signature at the start of .exe, which will tell you if the file is readable or not.

A lot of modern text files are in one of the UTF formats, which can usually be identified by reading the first chunk of the file, even if they don't have a BoM.

Basically, you are going to have to run through a lot of different file types to see if you get a match. Load the first kilobyte of the file into memory and run a lot of different checks on it. Once you have some data, you can order the checks to look for the most common formats first.

rossum
  • 15,344
  • 1
  • 24
  • 38