7

I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.

Let's say I have a document containing the text:

{"text": "The quick brown fox jumps over the lazy dog"}

I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".

In other words, I want to add noise to strings to generate misspelled words (typos).

What would be a way of automatically generating words with typos for evaluating fuzzy search?

Philip
  • 3,135
  • 2
  • 29
  • 43
  • 1
    There are a number of ways in which you can generate the queries, but the hard part is to decide the scope that you want to cover. Are you interested only in full mistyped words or can I have just pieces (e.g. "jun" for "jum"). Can I have non-consecutive words ("quick sox")? Only one typo or could be more? Missing withespace ("lazydog")? Are typos random or related to keyboard position (e.g. assuming qwerty, "n" for "m" but not "q" for "m")? Is case-sensitivity important? Can there be any unicode typos (from accents to emojis...)? What I mean I guess is it can be as hard or simple as you want. – jdehesa Jun 28 '18 at 10:09
  • 1
    @jdehesa I agree it is a broad question, but I think any answer concerning a specific scope, e.g. keyboard positions, could be useful. You could then generate mistyped words by randomizing from a pool of typo generators. – Philip Jun 28 '18 at 11:24

2 Answers2

5

I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.

Say you have a phrase

phrase = "The quick brown fox jumps over the lazy dog"

Then define a probability for a word to change (say 10%)

p = 0.1

Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word

import string
import random

new_phrase = []
words = phrase.split(' ')
for word in words:
    outcome = random.random()
    if outcome <= p:
        ix = random.choice(range(len(word)))
        new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])
        new_phrase.append(new_word)
    else:
        new_phrase.append(word)

new_phrase = ' '.join([w for w in new_phrase]) 

In my case I got the following interesting phrase result

print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'
kosnik
  • 2,342
  • 10
  • 23
3

Haven't used this myself, but a quick google search found https://www.dcs.bbk.ac.uk/~ROGER/corpora.html which I guess you can use to get frequent misspellings for words in your text. You can also generate misspellings yourself using keyboard distance, as explained here, I guess: Edit distance such as Levenshtein taking into account proximity on keyboard Perhaps there are some other databases/corpora of frequent misspellings other than the one referred to above, because I would guess that just randomly inserting/deleting/changing characters with a total levenhstein distance of, say, max 3 will not be a useful evaluation of your system, since people don't randomly make mistakes, but exhibit simple, logical patterns in the types of (spelling) mistakes made.

Igor
  • 1,251
  • 10
  • 21