4

I am working with Python and I want to match a given string with multiple substrings. I have tried to solve this problem in two different ways. My first solution was to match the substring with the string like:

str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([x.upper() for x in value if x.lower() in str.lower()])
print(temp)

which results in temp = ["TEST", "MATCH", "MULTIPLE", "RING"].

However, this is not the result I would like. The substrings should have an exact match, so "ring" should not match with "string".

This is why I tried to solve this problem with regular expressions, like this:

str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([
    x.upper() for x in value
    if regex.search(
        r"\b" + regex.escape(x) + r"\b", str, regex.IGNORECASE
    ) is not None
])
print(temp)

which results in ["TEST", "MATCH", "MULTIPLE"], the correct solution.

Be that as it may, this solution takes too long to compute. I have to do this check for roughly 1 million strings and the solution using regex will take days to finish compared to the 1.5 hours it takes using the first solution.

Is there a way to either make the first solution work, or the second solution to run faster?

value can also contain numbers, or a short phrase like "test1 test2".

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
jv3768
  • 322
  • 3
  • 14
  • You can probably save a lot of time by compiling your solution and running the compiled versions over your million strings – jeremycg Feb 01 '19 at 14:23
  • @jeremycg what do you mean exactly by "compiling your solution"? – jv3768 Feb 01 '19 at 14:24
  • `but this does not work when value contains substrings like "test1 test2"`. So if value has a word that is contained in `str`, there sould be a match? – yatu Feb 01 '19 at 14:24
  • by using `re.compile` as mentioned by @Kevin in his answer – jeremycg Feb 01 '19 at 14:26

4 Answers4

6

It's hard to suggest an optimal solution without seeing the actual data, but you can try these things:

  • Generate a single pattern matching all values. This way you would only need to search the string once (instead of once per value).
  • Skip escaping values unless they contain special characters (like '^' or '*').
  • Create the resulting list using a list comprehension instead of calling list.extend() repeatedly.
# 'str' is a built-in function, so use another name instead
string = 'A Test test string from which I want to match multiple substrings'
values = ['test', 'test2', 'Multiple', 'ring', 'match']
pattern = r'\b({})\b'.format('|'.join(map(re.escape, values)))

# unique matches, uppercased
matches = set(map(str.upper, re.findall(pattern, string, regex.IGNORECASE)))

# arrange the results as they appear in `values`
matched_values = [v for value in values if (v := value.upper()) in matches]
print(matched_values)  # ['TEST', 'MULTIPLE', 'MATCH']
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
  • Thank you for your answer. It helped me solve my problem. I only had to adjust `if x in matches` to `if x.lower() in matches` and do the same for `string` to be case insensitive. Also I used `temp.extend()` because I first appended something to `temp` before I do the matching. – jv3768 Feb 02 '19 at 13:04
  • I have used this code for quite some time now, but I have unfortunately found a problem. If I have `string = "test1 test2"` and `values = ['test1', 'test1 test2']`, it needs to match both instances. Right now I only get the first match. Do you maybe have an idea how I can fix this? – jv3768 Feb 16 '19 at 12:33
  • @jv3768 Unfortunately, it's not possible with a single search, as the regex engine returns the first valid match skipping other alternatives. You could try to analyze the input and filter values that are substrings of other values, then just include them in the results. – Eugene Yarmash Feb 17 '19 at 09:18
3

Two possible optimizations come to mind:

  • precompile patterns with re.compile so it doesn't recompile every time you call match.
  • rather than matching against four independent regexes, create one regex that matches all of your values.

 

import re

str = "This is a test string from which I want to match test1 test2 multiple substrings"
values = ["test", "match", "multiple", "ring", "test1 test2"]

pattern = re.compile("|".join(r"\b" + re.escape(x) + r"\b" for x in values))
temp = []

temp.extend([x.upper() for x in pattern.findall(str, re.IGNORECASE)])
print(temp)

Result:

['TEST', 'MATCH', 'TEST1 TEST2', 'MULTIPLE']

Potential drawbacks to this approach:

  • The output will possibly be in a different order. Your original approach puts results in the order they appear in values. This approach puts results in the order they appear in str.
  • the same value will appear multiple times in temp if it appeared multiple times in str. As opposed to your original approach, where the value appears at most once in temp.
  • search terminates as soon as it finds a match. findall always searches the entire string. If you expect most of your strings to match every word in value, and expect most matches to appear early on in the string, then findall may be slower than search. On the other hand, if you expect search to often turn up None, then findall will likely be somewhat faster.
Kevin
  • 74,910
  • 12
  • 133
  • 166
  • 'Compiling' the pattern won't provide any benefits here as you only use it once. Even if it was used multiple times, the [difference](https://stackoverflow.com/q/452104/244297) would most likely be negligible. – Eugene Yarmash Feb 01 '19 at 15:09
  • 1
    My interpretation of the actual problem is that there are millions of strings which must be searched through, rather than just the one present in the sample code. Imagine that there is a `for str in millions_of_strings:` line just after my `temp = []`. – Kevin Feb 01 '19 at 15:15
  • Thank you for your answer, but your first potential drawback is a real drawback for my code. I want to prioritze the first match over all other matches. I think the answer from Eugene will solve my problem – jv3768 Feb 01 '19 at 15:37
  • @jv3768 As far as I can tell, Eugene's answer will also have that drawback, since it also uses findall. – Kevin Feb 01 '19 at 15:43
  • 1
    @Kevin: One can sort the results based on their idx in `value` (I've added that step). – Eugene Yarmash Feb 01 '19 at 15:55
0

You could split the string by space and then match the elements from value with ==.

You said that some strings in values can have space before or after them. You can resolve that with this line:

values = [i.strip() for i in values]

That will remove all of the whitespace characters before and after the string (in your case for each element).

Furthermore, you mentioned that if you split the string by space, some words have punctuations left over from splitting, e.g. 'Hi, how are you?' would result in ['Hi,', 'how', 'are', 'you?']. You can resolve this problem by using the string startswith() method to filter all the words starting with elements from values like this:

str = ['Hi,', 'how', 'are', 'you?']`
values = ['how', 'you', 'time', 'space']

new_str = []
for word in str:
  for j in values:
    if word.startswith(j):
      new_str.append(word)

# result -> ['how', 'you?']

Then you can remove punctuations from the resulting list with some regex, but now you will have a lot smaller list to iterate through. After you remove all of the punctuation characters, then you can match the whole strings as I suggested.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Novak
  • 2,143
  • 1
  • 12
  • 22
  • Doesn't work if an element of `value` contains a space, or if the sentence contains punctuation next to a word that would otherwise be matched. – Kevin Feb 01 '19 at 14:25
  • well then you can do `value = [i.strip() for i in value]` and filter items from `str` using string method `startswith()` and then strip punctuations from filtered list and see the result – Novak Feb 01 '19 at 14:52
  • Sorry, I'm not sure I understand. Can you edit some complete code into your answer and demonstrate how it works? – Kevin Feb 01 '19 at 14:55
  • Thanks for your edit, but I think I'm still missing something. When I run that code block, I get `NameError: name 'i' is not defined`. – Kevin Feb 01 '19 at 15:43
  • "So you said that some strings in values can have space before or after them." I don't think that's what he's saying. I think he's saying that the values can have a space in them, not necessarily at the beginning or end. "test1 test2", for instance, contains a space. And it would be wrong to remove it, since "test1test2" should not be matched, and it would be wrong to split it into multiple elements, since "test1" should not be matched unless it comes immediately before a space followed by "test2". – Kevin Feb 01 '19 at 16:16
  • Thanks for the clarification of the problem, I didn't realize it was the case. Regrading your comment about the error, I fixed it but it was clearly the naming error, you could've figured it out on your own. – Novak Feb 04 '19 at 07:30
-1

It takes about 3 seconds for 1 million executions of the 'statement' variable on my laptop using the following code:

from timeit import timeit
import re

# I inserted punctuation to cover more situations
string = """
This, is a ; test string from which I want to match multiple substrings
"""
substrings = ['test', 'match', 'multiple', 'ring']

statement = """
string_normalized = (re.sub(r'[^\w\s]', '', string)
                       .strip()
                       .lower()
                       .split(' '))
result = list(filter(lambda x: x in string_normalized, substrings))
"""

if __name__ == '__main__':
    time_it_takes = timeit(stmt=statement, globals=globals(), number=1_000_000)
    print(time_it_takes)
tudor38
  • 1
  • 1
  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 20 '22 at 02:19