73

People search in my website and some of these searches are these ones:

tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a

My question is there any way to detect strings that similar to ones above ?

I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)

edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.

It doesn't matter if search result will be 0 or anything else. I can't use this logic.

Some new brands or products will be ignored if I will consider "regular words".

Thank you for your help

Uri
  • 25,622
  • 10
  • 45
  • 72
ahe
  • 761
  • 1
  • 6
  • 6
  • 4
    What exactly are you trying to detect? We need more information if we're going to help. – Dan Jun 09 '11 at 19:16
  • Even Google did not give any result for that put#@@ .Then What result you are giving ? :-) – zod Jun 09 '11 at 19:16
  • Maybe you could put a spell checker in your search form. – babsher Jun 09 '11 at 19:17
  • 8
    There is no way to detect with a machine if a search string makes sense or not. If they enter nonsense, they will find nothing - isn't this enough? – Tomalak Jun 09 '11 at 19:18
  • I was going to suggest calculating a weighted sum where consecutive characters that are adjacent on a keyboard get a stronger weight, and scale the result by string length, but that'll only catch one specific type of gibberish typing. – Doug Kavendek Jun 09 '11 at 19:31
  • @Doug Kavendeck you can use a similar idea in the opposite direction (look at lots English text, find what characters tend to be adjacent), and then use that to estimate the chance that some text is actually English. It is simple and it works reasonably well. – Rob Neuhaus Jun 10 '11 at 15:34
  • 1
    Why do you need to do this? Do gibberish searches make up a substantial fraction of search traffic, or impose a noticeable load on the database? – Nick Johnson Jun 11 '11 at 08:01
  • 1
    I would drop the idea because detecting garbage will most possibly need more computing power than executing a garbage query (which is technically correct and may even be what the user wants because *garbage* is pretty subjective, I guess). – Rob Jan 31 '18 at 05:29

10 Answers10

176

You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).

If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.

For background, read about Markov Chains.

Edit, I implemented this here in Python:

https://github.com/rrenaud/Gibberish-Detector

and buggedcom rewrote it in PHP:

https://github.com/buggedcom/Gibberish-Detector-PHP

my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True
Rob Neuhaus
  • 9,190
  • 3
  • 28
  • 37
  • 2
    Implying that Markov Chains are the "background" to the technique you are using gives the impression that you are doing something much more sophisticated than you really are. The reader requires no understanding of Markov Chains to understand your solution to this. – sanity Jul 04 '11 at 12:09
  • 7
    I've translated rrenaud's python script it into PHP https://github.com/buggedcom/Gibberish-Detector-PHP – buggedcom Oct 25 '11 at 12:05
  • 3
    I rewrote this in perl here https://github.com/complexitydev/PerlGibberishDetector – Ben Feb 27 '17 at 06:25
  • can anyone help me to understand what is gib_model.pki. as looking forward to having similar solutions through R language – ayush varshney Jul 29 '19 at 12:08
  • it's just a stored version of the model, so it doesn't need to read all the training data – Rob Neuhaus Jul 29 '19 at 21:46
  • @RobNeuhaus Where have you created model.... is this something a backend process of machine learning or sort of that kind of programming through which you have created gib_model.pki. – ayush varshney Jul 30 '19 at 09:33
  • 1
    I wrote a version that does this in nodejs https://www.npmjs.com/package/gibberish-detective – WhiskeyTangoFoxtrot May 30 '20 at 23:36
  • 1
    I ported it to Python 3: https://github.com/amitness/Gibberish-Detector – Amit Chaudhary Dec 12 '20 at 11:42
  • 1
    I think this is the correct answer, using the markov approach and not trying to compute the Shannon's entropy, since in the latter 'adxcelm' would still be gibberish, but it actually would have a very high normalized entropy. – Pepe Alvarez Feb 23 '22 at 03:50
  • 1
    Hi Rob, could you please explain to me what you mean by "Then normalize by the length of the query". I understand that the probability gets smaller when the string gets longer, but how can this be corrected? Thanks for your help! – D. Studer Apr 22 '22 at 16:25
9

You could do what Stackoverflow does and calculate the entropy of the string.

Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.

Community
  • 1
  • 1
BlueRaja - Danny Pflughoeft
  • 84,206
  • 33
  • 197
  • 283
8

I had to solve a closely related problem for a source code mining project, and although the package is written in Python and not PHP, it seemed worth mentioning here in case it can still be useful somehow. The package is Nostril (for "Nonsense String Evaluator") and it is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector in the answer by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.

Example: the following code,

from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
             'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
    print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

will produce the following output:

bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense

The project is on GitHub and I welcome contributions.

mhucka
  • 2,143
  • 26
  • 41
8

Assuming you mean jibberish searches... It would be more trouble than it's worth. You are providing them with a search functionality, let them use it however they please. I'm sure there are some algorithms out there that detect strange character groupings, but it would probably be more resource/labour intensive than just simply returning no results.

  • I think I could determine if a search is jibberish pretty well by keeping a 65KB (128 * 128 array of floats) table and basically just iterating through the string. I am sure it's going to be a lot cheaper than a database query that returns no results. – Rob Neuhaus Jun 09 '11 at 20:20
  • @rrenaud So you plan on doing this for every search, even the 'valid' ones? It's not worth it. –  Jun 10 '11 at 00:26
  • I get a 10 character input. So I add up 10 numbers, do a division, and compare to a threshold. The run time computation is super low. The biggest cost will be in collecting the data and coding it up. – Rob Neuhaus Jun 10 '11 at 00:32
  • @rrenaud Ok I get you... But that's some serious data collection and analysis. Way beyond my (and most people's) ability. –  Jun 10 '11 at 00:50
  • 6
    It's just a couple hours of hacking, https://github.com/rrenaud/Gibberish-Detector – Rob Neuhaus Jun 10 '11 at 04:45
  • @chrislegend If you could do a gibberish test only on the gibberish strings, you wouldn't need the test in the first place. – Nick Johnson Jun 11 '11 at 08:00
5

I'd think you could detect these strings the same way you could detect "regular words." It's just pattern matching, no?

As to why users are searching for these strings, that's the bigger question. You may be able to stem off the gibberish searches some other way. For example, if it's comment spam phrases that people (or a script) is looking for, then install a CAPTCHA.

Edit: Another end-run around interpreting the input is to throttle it slightly. Allow a search every 10 seconds or so. (I recall seeing this on forum software, as well as various places on SO.) This will take some of the fun out of searching for sdfpjheroptuhdfj over and over again, and at the same time won't interfere with the users who are searching for, and finding, their stuff.

John
  • 15,990
  • 10
  • 70
  • 110
  • Most of the visitors are kids so they just do. CAPTCHA is not a useful solution to put it before every search. Some new brands or products will be ignored if I will consider "regular words". Thank you for your help – ahe Jun 09 '11 at 19:25
  • If that's the case, then you might do all right by throttling the searches (slightly) -- as in allow one search every 10 seconds or so. That will take some of the fun out of searching for *sdfgbpoisdfbijhaoi* over and over, but won't impact the people actually looking for, and finding, what they need. – John Jun 09 '11 at 19:32
4

Short answer - Jibberish Search

Probabilistic Language Model works.

Logic

word is made up of sequence of characters, and if 2 characters come together more frequently and if we sum up all frequency of 2 contiguous characters coming together in word, and sum cross threshold limit (being an english word), it is said to proper english word. In brief, this logic is famous by Markov chains.

Link

For Mathematics of Gibberish and better understanding, refer to video https://www.youtube.com/watch?v=l15C8UJu17s . Thanks !!

Aakash Goel
  • 982
  • 7
  • 12
3

As some people commented, there are no hits in google for tapoktrpasawe or putjbtghguhjjjanika (Well, there are now, of course) so if you have a way to do a quick google search through an API, you could throw out any search terms that got no Google results and weren't the names of one of your products. Why you would want to do this is a whole other question - are you trying to save effort for your search library? Make your hand-review of "popular search terms" more meaningful? Or are you just frustrated at the inexplicable behaviour of some of the people out on the big wide internet? If it's the latter, my advice is just let it go, even if there is a way to prevent it. Some other weirdness will come along.

Kate Gregory
  • 18,808
  • 8
  • 56
  • 85
1

You can detect less common words using common bigrams and less-common bigrams.

Here is a simple python code:

def is_random_string(word, threshold=0.1):
    # Allow only words longer than 3 characters which contain only English alphabetic characters
    if len(word) < 4 or not word.isalpha():
        return False

    # Repeating characters
    if len(set(word)) == 1:
        return True

    # Turn word into lowercase
    word = word.lower()

    # Get list of bigrams from the word
    bigrams = [word[i:i + 2] for i in range(len(word) - 1)]

    # Get number of common and uncommon bigrams
    num_common_bigrams = sum(1 for bigram in bigrams if en_bigrams_dict.get(bigram, 0) > threshold)
    num_uncommon_bigrams = len(bigrams) - num_common_bigrams

    # Higher number wins
    if num_common_bigrams > num_uncommon_bigrams:
        return False
    else:
        return True

For example, you can install the package using

pip install random-string-detector

Then include the method

from random_string_detector.random_string_detector import is_random_string

and finally test the method

words = ["asdqweasdqw", "paykaprkg", "iwepr"]
for word in words:
    print(is_random_string(word, 5)) # The result is always 'True'

You can find en_bigrams_dict here and more details at Medium blog post here.

nightzus
  • 11
  • 2
0

If the search is performed on products, you could cache their names or codes and check them against that list before quering database. Else, if your site is for english users, you can build a dictionary of strings that aren't used in the english language, like qwkfagsd. Which, and agreeing with other answer, will be more resource intensive than if not there.

noinstance
  • 761
  • 1
  • 7
  • 23
  • 1
    So you are suggesting he check all searches against an English dictionary? Why would anyone want to do that. –  Jun 09 '11 at 19:18
  • Not really, I mean small strings like "asd" or "qwe" that people often use to fill inputs. – noinstance Jun 09 '11 at 19:21
0

If, like me, you just want a quick and dirty PHP solution, here is one:

function isGibberish($data) {
    $freq = count_chars(strtoupper($data), 0);
    $rareCount = $freq[ord('Q')] + $freq[ord('X')] + $freq[ord('Z')] + $freq[ord('J')] + $freq[ord('K')];
    $commonCount = $freq[ord('A')] + $freq[ord('E')] + $freq[ord('R')] + $freq[ord('S')] + $freq[ord('T')];
    $gibberishScore = floatval($rareCount) / ( 0.01 + $rareCount + $commonCount);
    return $gibberishScore > 0.3;
}   

It simply counts the 5 rarest letters, the 5 most common letters, and determines if the ratio is excessive. Tailor to your needs.

Dave B
  • 331
  • 2
  • 4