8

The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY keyboard layout".

Example: woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh

Is there any software that does this already (preferably free and open source) ?

If not, is there an active FOSS project whose goal is to achieve this?

If not, how would you suggest to implement such a software?

Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
Nicolas Raoul
  • 58,567
  • 58
  • 222
  • 373
  • 1
    Vandalism detection algorithms already include dictionary/grammar-based detection, so here I am looking for an algorithm that does NOT use dictionaries or grammar, but rather finger patterns. – Nicolas Raoul Sep 27 '10 at 08:45
  • 1
    and how exactly 'finger patterns' differ from dictionary entries plus grammar rules? It is the same approach, the distinction is that one is positive detection and the other negative detection. Furthermore - it is not clear what you are asking for - random keyboard hits considering qwerty is no different then random keyboard hits considering dvorak, unless they are not really random (maybe better call it 'commonly used vandalism constructs'). – Unreason Sep 27 '10 at 10:45
  • @Unreason: About your first question: I meant dictionaries and grammars of existing human languages. The "negative detection" you propose is interesting, feel free to propose it as an answer. About the "Furthermore": I reformulate my question: You are given a sequence of characters that have been typed on a QWERTY keyboard, how do you calculate the probability that it has been typed carelessly? (ie:by someone whose goal was not to express something but to quickly enter many characters, for instance oiuroiqewrcoqf) – Nicolas Raoul Sep 27 '10 at 11:21

5 Answers5

7

If two bigrams in analyzed text are close in QWERTY terms but have near zero statistical frequency in English language (like pairs "fg" or "cd") then there is chance that random keyboard hits are involved. If more such pairs are found then chance increases greatly.

If you want to take into account the use of both hands for bashing then test letters that are separated with another letter for QWERTY closeness, but two bigrams (or even trigrams) for bigram frequency. For example in text "flsjf" you would check F and S for QWERTY distance, but bigrams FL and LS (or trigram FLS) for frequency.

Dialecticus
  • 16,400
  • 7
  • 43
  • 103
  • 1
    +1 this sounds good, but first the list of these common bigrams for gibberish needs to extracted; otherwise the end result would be based on guesstimates (guessing which bigrams or trigrams are characteristic for gibberish). – Unreason Sep 27 '10 at 11:57
  • Maybe for OP it needs to be stated that bigram matching is the common algorithm found in spell checkers – Unreason Sep 27 '10 at 12:00
  • 1
    Accepted. For reference, I would like to add that repetition of an unusual bigram is a quasi-sure sign. – Nicolas Raoul Oct 04 '10 at 07:42
  • 1
    so to go back to Nicolas question: is there any open source lib that implemented this type of logic? – TheArchitect Oct 14 '13 at 18:17
  • @TheArchitect to that question I'm no smarter than Google – Dialecticus Oct 14 '13 at 23:08
3

Consider empirical distribution of sequences of two letters, ie "probability of having letter a given it follows letter b", all this probabilities fill a table of size 27x27 (considering space as a letter).

Now, compare this with historical data from a bunch of english/french/whatever texts. Use Kullback divergence for comparison.

Alexandre C.
  • 55,948
  • 11
  • 128
  • 197
2

Most keyboard mashing tends to be on the home row in my experience. It would be reasonably simple to check to see if a high proportion of the characters used are asdfjkl;.

fredley
  • 32,953
  • 42
  • 145
  • 236
1

Taking an approach based on keyboard layout will provide a good indicator. With a QWERTY layout you will find that around 52% of letters in any given text will be from the top line of keyboard characters. About 32% of characters will be from the middle line and 14% of will be from bottom line. While this varies slightly from one language to another, there remains a very clear pattern which can be detected. Use the same methodology to discover patterns in other keyboard layouts, then ensure you detect the layout used for any text entered before checking for gibberish. Even though the pattern is clear, it is best to use this method as one indicator only given that this methodology works best with longer scripts. Using other indicators such as non-alpha/numeric characters mixed with alpha/numeric, text length etc will provide further indicators which when applying weighting, can provide a pretty good overall indication of gibberish entry.

Glenn Bull
  • 11
  • 1
0

Fredley's answer can be extended to a grammar that would construct words from nearby letters.

For example asasasasasdf could be generated with a grammar that connects as, sa, sd and df.

With such grammar, expanded to all letters on the keyboard (with letters that are next to each other) could, after parsing, give you a measure of how much of a text can be generated with this 'gibberish' grammar.

Caveat: of course, any text discussing such grammar and listing examples of 'gibberish' text would score significantly higher then a regular spell-checked text.

Do note that the example approach would not catch vandalism in the form of 'h4x0r rulezzzzz!!!!!'.

Another approach here (which can be integrated with the above method) would be to statistically analyze a corpus of vandalized text and try to get common words in vandalized texts.

EDIT:
Since you are assuming QWERTY, I guess we could assume English, too?

What about KISS - run the text through english spell checker and if it fails miserably conclude that it is probably gibberish (the question is, why want to distinguish quickly typed gibberish from random nonsense or for that matter from very badly spelled text?)

Alternatively if other keyboard layouts (Dvorak, anyone?) and languages are to be considered, then maybe run the text through all available language spell checkers and then proceed (this would give language autodetect, too).

This would not be very efficient method, but could be used as a baseline test.

Note:
In the long run I imagine that vandals would adapt and start vandalizing with, for example excerpts from other wikipedia pages, which would be ultimately hard to automatically detect as vandalism (ok, existing texts could be checksummed and flag raised on duplicates, but if text came from some other source it would be ultimately hard).

Unreason
  • 12,556
  • 2
  • 34
  • 50
  • About your "Do note" paragraph: Indeed, the 'h4x0r rulezzzzz!!!!!' case is not targeted here, and it is actually taken care of by other means, which the winner's paper talks about. In brief: Character repetition of "zzzzz" and excessive punctuation would already mark it as probable vandalism. – Nicolas Raoul Sep 27 '10 at 12:01