How to measure "probability" that string is some sort of code or nonsense

Question

Let's assume that we have following strings:

q8GDNG8h029751
DNS
stackoverflow.com
28743.8.4.919
q7Q5w5dP012855
Martin_Luther
0000000100000000-0000000160000000
1344444967\.962
ExTreme_penguin

Obviously some of those can be, by our brain, classified as strings containing information, stings that have some "meaning" for humans. On the other hand, there are strings like "q7Q5w5dP012855" that are definitely some codes that could mean something only to computer.

My question is: Can we calculate some probability that string can actually tell us something?

I have some thoughts as doing frequency analysis or calculating capital letters etc. but it would be convenient to have something more 'scientific'

See http://stackoverflow.com/questions/92006/how-do-i-determine-if-a-random-string-sounds-like-english for some ideas. — Joe, Aug 05 '13 at 13:54

score 1 · Answer 1 · answered Aug 05 '13 at 13:54

If you know the language that the strings are in you could use digram or trigram letter frequencies for the words in that language. These are quite small lookup tables [26 x 26] or [26 x 26 x 26] each entry can be a floating point number which is the probability of that string occurring in the language. Many of these would be zero for meaningless string. You could add them up or simply count the number of zero probability sequences.

Of course this needs setting up for each language.

How to measure "probability" that string is some sort of code or nonsense

1 Answers1