0

I need to be able to test a text list for file names that seem random;

e.g. aggvvcx.com or kbzaandc.exe

Is there any sensible/reasonable way to do this? The only idea I have is to check for ratio of occurrence of vowels to consonant but this doesn't seem reliable neither does using a dictionary.

EDIT: Definition of randomness

The only information I have about the nature of the randomness is that it is a filename. Maybe it is possible to get a dictionary of common file names and use some kind of pattern parser to determine common file naming patterns and run this against the list after training? This would obviously be a futile approach if we're considering multiple languages, but I'm only interested in checking for English filenames.

Chibueze Opata
  • 9,856
  • 7
  • 42
  • 65
  • 4
    We can't define for you what "random" means. Once you figure this out for yourself, you have the answer to your question. – Jeroen Vannevel Sep 03 '15 at 22:59
  • 4
    You're going to have to provide a much better definition of "seems random," because a randomly-generated string has just as much chance of producing something like "myfile" as it does "qzzyei". That's randomness for you. – TigerhawkT3 Sep 03 '15 at 23:00
  • This might be useful: https://en.wikipedia.org/wiki/Randomness_tests. You might want to try asking on [cs.se]. – Cyphase Sep 03 '15 at 23:03
  • 1
    given the .exe and .com parts, it would seem that a dictionary would be the best choice with a string based distance function. Other then that, you could construct a list of known executables and flag anything that does not belong there. – Nicko Po Sep 03 '15 at 23:05
  • @TigerhawkT3 That's precisely my problem. I think randomness by definition says you can't know if I am random. – Chibueze Opata Sep 03 '15 at 23:05
  • Exactly. Did Shakespeare write this play, or did it come from an infinite number of monkeys? There's no way to tell. – TigerhawkT3 Sep 03 '15 at 23:07
  • @TigerhawkT3, obviously from theoretic point of view, but given that he is hunting malware, i think he is looking for any ~reasonable~ heuristic :) – Nicko Po Sep 03 '15 at 23:14
  • @NickoPo - where is malware mentioned? – TigerhawkT3 Sep 03 '15 at 23:18
  • The definition you have added to your post is not a definition so much as a DWIM ("Do What I Mean"), i.e., wishful thinking. – TigerhawkT3 Sep 03 '15 at 23:23
  • @TigerhawkT3 if I provide a precise *definition* like you suggest, then it's obviously no longer random. – Chibueze Opata Sep 03 '15 at 23:25
  • 2
    Which of these is random (without googling): `nvvsvc.exe`, `msseces.exe` or `cvvil.exe`. Two of them are common services packaged with windows, one is 'random' – Rob Sep 03 '15 at 23:31
  • Hmmm, nice suggestion @Rob, Google doesn't allow direct use of their Search API but I could try Bing. To answer the question directly though, they all seem random except for msseces.exe – Chibueze Opata Sep 03 '15 at 23:33

5 Answers5

1

What do you mean by random exactly? There are many ways to answer that question.

Technically, it could be "how much entropy they contain" using information theory methods.

Since you mention dictionary, you may actually mean "do they look like real words?" This can be checked for long texts using letter distribution, but will fail for short names like the ones you show. Instead you could try n-grams for characters. This is similar to letter frequency, but for 2/3-letter sequences. That means if you try bigrams, you'll find that the first word contains "gv", "vv", "vc", "cx", which are likely impossible to find in any English word.

There are also other ways to answer the question, so you'll have to figure out what does "random" mean to you in this situation exactly.

viraptor
  • 33,322
  • 10
  • 107
  • 191
1

You could try

  1. https://github.com/rrenaud/Gibberish-Detector

  2. for longer strings gzip compression with zlib where greater compression indicates smaller randomness

  3. frequency analysis of characters in the string compared to averages for appropriate natural languages

  4. Google search assuming random strings are likely to have significantly fewer hits

  5. soundex to determine if the string has at least one syllable and is therefore more likely to be pronunciable and so less likely to be random

  6. n-grams with naive Bayesian analysis (http://theory.stanford.edu/~dfreeman/papers/namespam.pdf)

  7. train a neural network to do it similarly as for spam filtering

  8. a combination of all of the above for best results based on the approach of the winner of the Netflix challenge, namely that a combination of relatively mediocre tests may produce a much better one.

  • Excellent, thank you very much. Apparently this is related to a [Stackoverflow Question](http://stackoverflow.com/q/6297991/612717) which answers the question very well. – Chibueze Opata Sep 04 '15 at 00:11
  • @ChibuezeOpata: Yes I saw that one and more than a few others not all on SO. It's an interesting question and becoming more difficult to solve due to internationalization (is the string random or from a different language) and namespace crowding (forcing people to randomize strings for login names in order to get a slot). –  Sep 04 '15 at 00:24
  • Indeed, man creating problems for himself since whenever. Usernames/nicknames will probably become deprecated in 50 years. – Chibueze Opata Sep 04 '15 at 00:26
0

There are lots of randomness tests, so issue one for you is going to be deciding what you mean by randomness. I'm afraid that it's a non-trivial issue making that decision. But the Wikipedia page makes a good place to start.

https://en.wikipedia.org/wiki/Randomness_tests

The good news is that if you just need it to be "pretty jumbled" then there are a number of reasonable (i.e., computationally cheap and generally good-enough) approaches you can take.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Burleigh Bear
  • 3,274
  • 22
  • 32
0

One semi rough and quick heuristic check would be to sort a string by individual letters, and compare its sorted sequence against a likelihood of generating that sequence randomly for that length. i.e. for word length 2, a (sorted) string "AA" given 26 letters in alphabet, has a chance of 1/(26*26), but a (sorted) string "AB" - which is generated by "AB" and "BA" - has a chance of 2/(26*26).

P.S. from programming perspective, another way, would be to run a spell checker against it and figure out how many "mistakes" are there. Then put a threshold value against it.

Nicko Po
  • 727
  • 5
  • 21
0

I had to solve a closely related problem for a source code mining project and developed Nostril (for "Nonsense String Evaluator"). This Python 3 package is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.

Example: the following code,

from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
             'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
    print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

will produce the following output:

bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense

The project is on GitHub and I welcome contributions.

mhucka
  • 2,143
  • 26
  • 41