Regex in python: word matching if n characters in a row are equal to the pattern

Question

Suppose there is a key word(password)'rain'. The program must be executed only if 75% of characters in a row(!) of the word, provided by user, are equal to the key word:

Here is my regex code:

key = 'rain'
l_word = int(len(key) *3 /4)
my_regex = r'^[a-z0-9_]*' + '[' + key + ']'+'{' + str(l_word) +  ',}'  + '[a-z0-9_]*$' 
bool(re.match(my_regex,'air'))

where l_word is length of 75% of the key word. But in my_regex there is a problematic place: namely '[' + key + ']', because it matches any combination of the key word(in my case it's "rain"), but not in a row. For example "air" shouldn't work, but "12Qain" should.

How can I fix that?

Are those characters supposed to be contiguous? For example, is `12Qasin` a correct answer? — MTMD, Aug 30 '18 at 19:00
You could look into creating n-grams of your key, where `n` is the minimum length to get at least 75% of your key. So, for example, if the key is 5 characters long, you'd need 4 letters to get at least 75%, and so the 4-gram for a key of 'paint' would be 'pain' and 'aint'. [This question](https://stackoverflow.com/questions/13423919/computing-n-grams-using-python) should help you create those. Then you'd just paste your n-grams together with `|` to look for any of those combinations. — tblznbits, Aug 30 '18 at 19:14

score 2 · Answer 1 · answered Aug 30 '18 at 19:16

Are you sure you need a regex ? Something like this can calculate the ratio in a row :

>>> a = list('abce')
>>> b = list('abcd')
( 100 - (sum(i != j for i, j in zip(key, 'air')) / float(len(a))) * 100 )
75.0

But if b = list('bdce') it will just be 50%

score 1 · Answer 2 · answered Aug 30 '18 at 19:22

You may use this alternation based approach:

>>> key = 'rain'
>>> l_word = int(len(key) *3 /4)

>>> my_regex = re.compile(r'^' + key[0:l_word] + '|' + key[-l_word:] + '$')

>>> print (my_regex.pattern)
^rai|ain$

>>> print bool(my_regex.search('air'))
False
>>> print bool(my_regex.search('12Qain'))
True
>>> print bool(my_regex.search('raisin'))
True

Regex ^rai|ain$ either matches 75% characters of given keyword at the start or at the end.

tblznbits · Answer 3 · 2018-08-30T19:46:54.513

This approach uses n-grams to allow for varying ratios and varying lengths of the key, while simultaneously ensuring that the letters must be continuous.

import re
import math

key = 'paint'
n = math.ceil(len(key) * 0.75) # use ceiling for when len(key) * 3 is not a factor of 4

def ngrams(key, n):
    output = []
    for i in range(len(key) - n + 1):
        output.append(key[i:(i+n)])
    return output

patterns = '|'.join(ngrams(key, n))
regex = r'^[a-z0-9_]*' + patterns + '[a-z0-9_]*$'

print("Allowed matches: {}".format(patterns))
print("Pants matches: {}".format(bool(re.search(regex, 'pants'))))
print("Pains matches: {}".format(bool(re.search(regex, 'pains'))))
print("Taint matches: {}".format(bool(re.search(regex, 'taint'))))

Allowed matches: pain|aint
Pants matches: False
Pains matches: True
Taint matches: True

Keep in mind, that Python already has a method for checking for substrings using the in key word with two strings. So you can skip the regex and do this:

patterns = ngrams(key, n)
for test in ['pants', 'pains', 'taint']:
    matches = 0
    for pattern in patterns:
        if pattern in test:
            matches += 1
    if matches:
        print(test, 'matches')
    else:
        print(test, 'did not match')

pants did not match
pains matches
taint matches

I was about to post the for loop approach, but with the structure `for... if match: break; ... else: print('no match')` instead so the loop terminate on the first match found — xdze2, Aug 30 '18 at 19:57
@xdze2 That would certainly make it shorter, which is a better approach if there are lots of terms to search through. — tblznbits, Aug 30 '18 at 19:58

Regex in python: word matching if n characters in a row are equal to the pattern

3 Answers3