1

I would like to validate if a subscriber typed in meaningful English text in a form instead of random characters as e.g. 'jaghdjagqfsg'.

For "nonsense" I simply mean random characters, dots/numbers/spaces only, four or more consonants, etc. I am just looking for recipes or regex patterns for not reinventing the wheel.

E.g. for onsubmit event:

function validateForm() {
  var x = document.forms["myForm"]["fname"].value;
  if (x == "jaghdjagqfsg") {
    alert("Please answer the question");
    return false;
  }
}
  • 3
    But what if my name really is jaghdjagqfsg? – j08691 Aug 05 '20 at 19:37
  • @j08691 Username checks out – Captain Aug 05 '20 at 19:38
  • You would need set of dictionary if you wanted to filter meaningful keywords. Do you have it or you are looking at it more generally. – Gurwinder Aug 05 '20 at 19:39
  • 1
    I think we have to define "nonsense" text. Its hard for system to know unless we know some sort of pattern. Either you should have a dictionary of acceptable words or dictionary of not allowed words or some sort of pattern. – rootkonda Aug 05 '20 at 19:40
  • 1
    You haven't given us a well defined problem. It isn't clear what "meaningful english" is, and I'm quite sure there is no viable way it's definition is compact enough to be discussed here. – Gershom Maes Aug 05 '20 at 19:40
  • Thanks for the comments. I have edited the question. – Frederick Hose Aug 05 '20 at 20:01
  • I think you may be asking "How to make a cheap spell-checker" which is non-trivial but has an interesting history https://en.wikipedia.org/wiki/Spell_checker – jezpez Aug 06 '20 at 04:04

2 Answers2

2

Here is my idea: we get a sense of the typical frequencies of letters in english. Then we compare those frequencies against the frequencies of the input text. (This can be done by computing the magnitude of a "vector difference"). If the frequencies deviate past some threshold, we will assume that the input text is not comprehensible english. I'm using some copy-pasted poem to get a sense of normal letter frequencies - you may be able to simply look up a more definitive list of frequencies for yourself. The threshold I'm using gets stricter as more characters are entered (most gibberish is filtered out, and most reasonable sentences are allowed). The box in the bottom right indicates the current score of the entered text / threshold to beat in order to be considered valid.

let textElem = document.querySelector('textarea');
let pElem = document.querySelector('p');

let sampleText = `
Hail to thee, blithe Spirit! Bird thou never wert, That from Heaven, or near it, Pourest thy full heart In profuse strains of unpremeditated art.
Higher still and higher From the earth thou springest Like a cloud of fire; The blue deep thou wingest, And singing still dost soar, and soaring ever singest.
In the golden lightning Of the sunken sun, O'er which clouds are bright'ning, Thou dost float and run; Like an unbodied joy whose race is just begun.
The pale purple even Melts around thy flight; Like a star of Heaven, In the broad day-light Thou art unseen, but yet I hear thy shrill delight,
Keen as are the arrows Of that silver sphere, Whose intense lamp narrows In the white dawn clear Until we hardly see, we feel that it is there.
All the earth and air With thy voice is loud, As, when night is bare, From one lonely cloud The moon rains out her beams, and Heaven is overflow'd.
What thou art we know not; What is most like thee? From rainbow clouds there flow not Drops so bright to see As from thy presence showers a rain of melody.
Like a Poet hidden In the light of thought, Singing hymns unbidden, Till the world is wrought To sympathy with hopes and fears it heeded not:
Like a high-born maiden In a palace-tower, Soothing her love-laden Soul in secret hour With music sweet as love, which overflows her bower:
Like a glow-worm golden In a dell of dew, Scattering unbeholden Its a{:e}real hue Among the flowers and grass, which screen it from the view:
Like a rose embower'd In its own green leaves, By warm winds deflower'd, Till the scent it gives Makes faint with too much sweet those heavy-winged thieves:
Sound of vernal showers On the twinkling grass, Rain-awaken'd flowers, All that ever was Joyous, and clear, and fresh, thy music doth surpass.
Teach us, Sprite or Bird, What sweet thoughts are thine: I have never heard Praise of love or wine That panted forth a flood of rapture so divine.
Chorus Hymeneal, Or triumphal chant, Match'd with thine would be all But an empty vaunt, A thing wherein we feel there is some hidden want.
What objects are the fountains Of thy happy strain? What fields, or waves, or mountains? What shapes of sky or plain? What love of thine own kind? what ignorance of pain?
With thy clear keen joyance Languor cannot be: Shadow of annoyance Never came near thee: Thou lovest: but ne'er knew love's sad satiety.
Waking or asleep, Thou of death must deem Things more true and deep Than we mortals dream, Or how could thy notes flow in such a crystal stream?
We look before and after, And pine for what is not: Our sincerest laughter With some pain is fraught; Our sweetest songs are those that tell of saddest thought.
Yet if we could scorn Hate, and pride, and fear; If we were things born Not to shed a tear, I know not how thy joy we ever should come near.
Better than all measures Of delightful sound, Better than all treasures That in books are found, Thy skill to poet were, thou scorner of the ground!
Teach me half the gladness That thy brain must know, Such harmonious madness From my lips would flow The world should listen then, as I am listening now.
`;

let getCharFrequency = text => {
  
  // Each character increments the value in `f` under the key which is the character
  let f = {};
  for (let char of text.toLowerCase()) f[char] = (f.hasOwnProperty(char) ? f[char] : 0) + 1;
  
  // Normalize this vector by dividing every value by the length
  // Note that `vectorDiffMag` calculates the length if the second
  // vector is `{}` (the "0-vector")
  let len = vectorDiffMag(f, {});
  for (let k in f) f[k] = f[k] / len;
  
  return f;
  
};
let vectorDiffMag = (freq1, freq2) => {
  
  // Returns the magnitude of the vector difference
  // It is essentially a square root of squared differences
  
  let allKeys = new Set([ ...Object.keys(freq1), ...Object.keys(freq2) ]);
  let squareSum = 0;
  for (let key of allKeys) {
    let v1 = freq1.hasOwnProperty(key) ? freq1[key] : 0;
    let v2 = freq2.hasOwnProperty(key) ? freq2[key] : 0;
    let diff = v2 - v1;
    squareSum += diff * diff; // Add the square
  }
  return Math.sqrt(squareSum); // Return the overall square root
  
};

// We only need to compute our "main" frequencies once
let mainFreqs = getCharFrequency(sampleText);

textElem.addEventListener('input', evt => {
  
  // The more characters typed, the stricter the threshold becomes
  // Note these constants allow tuning how strict the threshold
  // becomes as more input is received. I think I've tuned them
  // somewhat well but you may be able to optimize them further:
  let a = 5;     // Control the rate of exponential tightening
  let b = 0.85;  // Control the rate of linear tightening
  let c = 0.55;  // Asymptote (strictest possible threshold)
  let thresh = Math.log(1 + a / textElem.value.length) * b + c;
  
  // Get the magnitude of the vector difference between the "main"
  // frequencies, and the user's input's frequencies
  let diff = vectorDiffMag(mainFreqs, getCharFrequency(textElem.value));
  
  // Render results:
  pElem.innerHTML = `${diff.toFixed(3)} (${thresh.toFixed(2)})`;
  if (diff < thresh) {
    textElem.classList.remove('invalid');
    textElem.classList.add('valid');
  } else {
    textElem.classList.remove('valid');
    textElem.classList.add('invalid');
  }
  
});
textarea {
  position: absolute;
  box-sizing: border-box;
  width: 90%; height: 90%;
  left: 5%; top: 5%;
  resize: none;
  font-size: 150%;
}
textarea.valid { background-color: rgba(0, 150, 0, 0.15); }
textarea.invalid { background-color: rgba(150, 0, 0, 0.15); }
p {
  position: absolute;
  right: 0; bottom: 0;
  padding: 5px; margin: 0;
  background-color: #ffffff;
  box-shadow: inset 0 0 0 1px #000;
  font-size: 120%;
} 
<textarea placeholder="type something!"></textarea>
<p>1.000 / Infinite</p>

EDIT: As pointed out by jezpez, this is hardly a silver bullet! In order to get better validation you would need to use this method in combination with other techniques.

Gershom Maes
  • 7,358
  • 2
  • 35
  • 55
  • A great answer which works well enough for long strings and avoids seeding the 'known words' list with an entire english dictionary. But it is brittle and with creative answers and real names of things, or enough words with repeating letters "Mississippi" will fail this check. But a cool idea and cool sample. – jezpez Aug 06 '20 at 04:02
0

As I mentioned in the comment, Firstly we have to define what is "non-sense" text. Because it is hard for system to know unless we know some sort of pattern on what is needed or what is not needed.

So you can either come up with dictionary of words which can be allowed or words which are not allowed or you find out a general pattern which can be matched against regex to either filter out or accept a valid user name.

rootkonda
  • 1,700
  • 1
  • 6
  • 11