4

Human readable, meaning the string is a real word. This is essentially a form validation. Ideally I'd like to test the 'texture' of the form responses to determine if an actual user has filled out the form versus someone looking for form vulnerabilities. Possibly using a dictionary look-up on the POSTed data and then giving a threshold of returned 'real words'.

I don't see anything in the PHP docs and the Google machine isn't offering up anything, at least this specific. I suspect that someone out there has written a PHP class or even a jQuery plugin that can do this. Something like so:

$string = "laiqbqi";

is_this_string_human_readable($string);

Any ideas?

Dan Whitinger
  • 165
  • 2
  • 9
  • 2
    Related http://stackoverflow.com/questions/6297991/is-there-any-way-to-detect-strings-like-putjbtghguhjjjanika and https://github.com/buggedcom/Gibberish-Detector-PHP. Some other cool techniques outlined http://stackoverflow.com/a/4674100/46675 – Mike B Jun 01 '12 at 16:35
  • 3
    Define human-readable. Do you mean pronounceable? Or real words? The latter is most efficiently done via a dictionary lookup. A pronunciation check is a bit more involved. – Unsigned Jun 01 '12 at 16:35
  • There's also this: http://stackoverflow.com/questions/2229054/php-dictionary-class-or-alternative – karim79 Jun 01 '12 at 16:38
  • Thanks Mike. Good find on the Gibberish Detector. – Dan Whitinger Jun 01 '12 at 16:52

1 Answers1

8

This can be done using something called Markov Chains.

Essentially, they read through a large chunk of text in a given language (English, French, Russian, etc.) and determine the probability of one character being after another.

e.g. a "q" has a much lower probability of occurring after a "z" than a vowel such as "a" does.

At a lower level, this is actually implemented as a state machine.

As per Mike's comment, a PHP version of this can be found here.

For flavor, an amusing the Daily WTF article on Markov Chains.

Codeman
  • 12,157
  • 10
  • 53
  • 91