regular expression string search in python with similar string

Question

I'm trying to very the presence of a series of given substring within a larger string in python 3.7 using Regular Expression. The problem is similar to this:

import re
LargeString = 'Lorem ipsum is simply dummy text of the printing and typesetting industry RE-2 and other Lorem ipsum RE-44'
substringSet = ['RE-2','RE-44','RE-4']
for _c in substringSet:
    _idxDevice = re.findall(_c,LargeString)

Clear in this way the regular expression will give a positive answer for all the elements of the substringSet whereas I would like to be able to distinguish the fact that RE-44 is present in the LargeString wherease RE-4 not. Any idea?

Add a word boundary `\b` at the end of each pattern. (read a basic tutorial about regex patterns). — Casimir et Hippolyte, Jan 31 '21 at 18:43
if you need just to check the presentence of every item of your substringSet, then check my variant from answers. If not then please add more details — dukkee, Jan 31 '21 at 18:46
@dukkee your solution works great. Actually I was unable to use the word boundary solution proposed, but I guess this depends from my implementation. Thanks — Nicola Vianello, Feb 01 '21 at 10:06

dukkee · Accepted Answer · 2021-01-31T18:43:38.077

0

Why your current code is bad? Just add

>>> import re
>>> large_string = 'Lorem ipsum is simply dummy text of the printing and typesetting industry RE-2 and other Lorem ipsum RE-44'
>>> substring_set = ['RE-2', 'RE-44', 'RE-4']
>>> {pattern: bool(re.search(f"({pattern} )|({pattern}$)", large_string)) for pattern in substring_set}
{'RE-2': True, 'RE-44': True, 'RE-4': False}

edited Jan 31 '21 at 18:43

answered Jan 31 '21 at 18:29

dukkee

1,112
1
9
17

score 0 · Answer 2 · answered Jan 31 '21 at 18:29

simple idea, add everything you find in an array then perform operations either within the loop or on the array later. Using set will add values only once.

import re
LargeString = 'Lorem ipsum is simply dummy text of the printing and typesetting industry RE-2 and other Lorem ipsum RE-44'
substringSet = ['RE-2','RE-44','RE-4']
present=set()
for _c in substringSet:
    _idxDevice = re.findall(_c,LargeString)
    present.add(_c)

score 0 · Answer 3 · answered Jan 31 '21 at 19:28

You can combine the substrings into one regular expression that looks form them all but prioritizes the longer substrings.

import re
LargeString = 'Lorem ipsum is simply dummy text of the printing and typesetting industry RE-2 and other Lorem ipsum RE-44'
substringSet = ['RE-2','RE-44','RE-4']           # RE-44 must be placed before RE-4
allSubs = "|".join(map(re.escape,substringSet))  # r'RE\-2|RE\-44|RE\-4'

found = [*re.findall(allSubs,LargeString)]  # ['RE-2', 'RE-44']
for _c in substringSet:
    _idxDevice = [_c]*found.count(_c) # repeated _c as you original code would
    print(_idxDevice)

['RE-2']
['RE-44']
[]

regular expression string search in python with similar string

3 Answers3