match words with few differences allowed

Question

I was wondering is there is any tool to match almost the same word for a bash terminal.

In the following file, called list.txt contain 1 word per line:

ban
1ban
12ban
12ban3

It is easy to find words containing "ban"

grep -E "*ban*" list.txt

Question:

How to actually match words that are have x letters differences? With the search word "ban", I expect the match "1ban" for X=1.

Concerning the notion of distance, I want to have maximum: X deletion or X substitutions or X insertions

Any tool, but preferentially something you could call as command-line on a bash terminal.

NOTE: The Levenshtein Distance will count an insertion of 2 letter as 1 difference. This is not what I want.

Maybe you want `grep "[a-z]anana" list.txt`? Or even `grep "[^[:space:]]anana" list.txt`. Probably, your answer is [already here](https://stackoverflow.com/questions/30355972/fuzzy-string-matching-with-grep). — Wiktor Stribiżew, Jul 11 '18 at 08:08
If Python comes into play, you may use the PyPi regex package and use fuzzy matching combined with regex functionality. — Wiktor Stribiżew, Jul 11 '18 at 08:13
You are poviding some solution to an unknown set of requirements. This is not a good idea because without exact requirements, all solutions are considered good and wrong - please update the question with a real set of requirements. Are you searching whole words or not? What edits do you want to consider? Just substitutions without insertions/deletions? Please make the question answerable. — Wiktor Stribiżew, Jul 12 '18 at 12:27
True, I actually understand why it is hard to answer this question. The notion of distance can be interpreted on different way. I am looking for whole words, and I actually want to have maximum X difference (so 1 deletion OR 1 substitution OR 1 deletion). Can you update your answer? — Gildas, Jul 12 '18 at 14:04
I have updated my solution and tested it a bit, it seems working as per the current requirements. Sorry for delay. — Wiktor Stribiżew, Jul 20 '18 at 07:58

Wiktor Stribiżew · Accepted Answer · 2018-07-20T07:52:25.717

1

You may use Python PyPi regex class that supports fuzzy matching.

Since you actually want to match words with maximum X difference (1 deletion OR 1 substitution OR 1 deletion), you may create a Python script like

#!/usr/bin/env python3
import regex, io, sys

def main(argv):
        if len(argv) < 3:
                # print("USAGE: fuzzy_search -searchword -xdiff -file")
                exit(-1)
        search=argv[0]
        xdiff=argv[1]
        file=argv[2]
        # print("Searching for {} in {} with {} differences...".format(search, file, xdiff))
        with open(file, "r") as f:
                contents = f.read()
                print(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents))

if __name__ == "__main__":
        main(sys.argv[1:])

Here, {s<=1,i<=1,d<=1} means we allow the word we search for 1 or 0 substitutions (s<=1), 1 or 0 insertions (i<=1) or 1 or 0 deletions (d<=1).

The \b are word boundaries, thanks to that construct, only whole words are matched (no cat in vacation will get matched).

Save as fuzzy_search.py.

Then, you may call it as

python3 fuzzy_search.py "ban" 1 file

where "ban" is the word the fuzzy search is being performed for and 1 is the higher limit of differences.

The result I get is

['ban', '1ban']

You may change the format of the output to line only:

print("\n".join(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents)))

Then, the result is

ban
1ban

edited Jul 20 '18 at 07:52

answered Jul 11 '18 at 09:22

Wiktor Stribiżew

607,720
39
448
563

Thanks you! This seem like doing the job... except that I cannot use it to search for x differences (let say if you allow 2 difference instead of 1). It is apparently not getting rid of the end of line char. Using f.readline().strip(), I can loop on each line, but then something is wrong with the regexp. Could you explain me a bit more your expression? – Gildas Jul 11 '18 at 10:43
@Gildas If you need to allow no more than 2 substitutions, replace `s<=1` with `s<=2`. You may also use a variable here to be passed via command line. What do you mean by end of line? You may `.strip()` it. See [the PyPi regex docs](https://pypi.org/project/regex/) for more examples. `{0 – Wiktor Stribiżew Jul 11 '18 at 10:44
@Gildas Please update your question to be more specific about what you want. My current answer gives you a jumpstart. – Wiktor Stribiżew Jul 12 '18 at 04:59
Hi Wiktor. I am still doing some tests on your answer. I created a variable X that I pass to the regular expression r"(?:{}){{s<=X,i<=X,d<=X,e<=X}}".format(regex.escape(search.strip() ), but so far it is not doing what I expect. I will update the question, but did you try it? the regular expression is not clear for me, and return a partial match of a word. – Gildas Jul 12 '18 at 11:45
@Gildas Exactly. If you plan to match whole words, use a word boundary ,`\b`, on both ends, `r"\b(?:{0}){{s<={1},i<={1},d<={1},e<={1}}}\b".format(regex.escape(search), threshold)` (where `threshold` is the limit of the differences). If you do not want to allow deletions, remove `,d<={1}`. – Wiktor Stribiżew Jul 12 '18 at 12:00
Using banana can be misleading, so I changed for "ban". I hope the question/ requirements are now easier to understand. – Gildas Jul 12 '18 at 14:12

score 0 · Answer 2 · edited Jul 13 '18 at 04:10

You can check the difference as shown below by checking each character using python,

def is_diff(str1, str2):
    diff = False
    for char1, char2 in zip(str1, str2):
        if char1 != char2:
            if diff:
                return False
            else:
                diff = True
    return diff
with open('list.txt') as f:
    data = f.readlines()

for line in data:
    print is_diff('ban', line)

match words with few differences allowed

2 Answers2