-1

I'm trying to very the presence of a series of given substring within a larger string in python 3.7 using Regular Expression. The problem is similar to this:

import re
LargeString = 'Lorem ipsum is simply dummy text of the printing and typesetting industry RE-2 and other Lorem ipsum RE-44'
substringSet = ['RE-2','RE-44','RE-4']
for _c in substringSet:
    _idxDevice = re.findall(_c,LargeString)

Clear in this way the regular expression will give a positive answer for all the elements of the substringSet whereas I would like to be able to distinguish the fact that RE-44 is present in the LargeString wherease RE-4 not. Any idea?

Nicola Vianello
  • 1,916
  • 6
  • 21
  • 26
  • Add a word boundary `\b` at the end of each pattern. (read a basic tutorial about regex patterns). – Casimir et Hippolyte Jan 31 '21 at 18:43
  • if you need just to check the presentence of every item of your substringSet, then check my variant from answers. If not then please add more details – dukkee Jan 31 '21 at 18:46
  • @dukkee your solution works great. Actually I was unable to use the word boundary solution proposed, but I guess this depends from my implementation. Thanks – Nicola Vianello Feb 01 '21 at 10:06

3 Answers3

0

Why your current code is bad? Just add

>>> import re
>>> large_string = 'Lorem ipsum is simply dummy text of the printing and typesetting industry RE-2 and other Lorem ipsum RE-44'
>>> substring_set = ['RE-2', 'RE-44', 'RE-4']
>>> {pattern: bool(re.search(f"({pattern} )|({pattern}$)", large_string)) for pattern in substring_set}
{'RE-2': True, 'RE-44': True, 'RE-4': False}
dukkee
  • 1,112
  • 1
  • 9
  • 17
0

simple idea, add everything you find in an array then perform operations either within the loop or on the array later. Using set will add values only once.

import re
LargeString = 'Lorem ipsum is simply dummy text of the printing and typesetting industry RE-2 and other Lorem ipsum RE-44'
substringSet = ['RE-2','RE-44','RE-4']
present=set()
for _c in substringSet:
    _idxDevice = re.findall(_c,LargeString)
    present.add(_c)
Pratik Agrawal
  • 405
  • 3
  • 17
0

You can combine the substrings into one regular expression that looks form them all but prioritizes the longer substrings.

import re
LargeString = 'Lorem ipsum is simply dummy text of the printing and typesetting industry RE-2 and other Lorem ipsum RE-44'
substringSet = ['RE-2','RE-44','RE-4']           # RE-44 must be placed before RE-4
allSubs = "|".join(map(re.escape,substringSet))  # r'RE\-2|RE\-44|RE\-4'

found = [*re.findall(allSubs,LargeString)]  # ['RE-2', 'RE-44']
for _c in substringSet:
    _idxDevice = [_c]*found.count(_c) # repeated _c as you original code would
    print(_idxDevice)

['RE-2']
['RE-44']
[]
Alain T.
  • 40,517
  • 4
  • 31
  • 51