2

I am trying to write a function to return all occurrences of a substring that contains wildcards (each wildcard accounting for only one character) within a longer string. For instance, let's say I have the subject string: aabcddcabaabedcbabaa and my query string is b?d??ab. The expected output would be: ['bcddcab', 'bedcbab']

Looking through other stack overflow posts, I've tried the following:

import fnmatch
subject = "aabcddcabaabedcbabaa"
query = "b?d??ab"
res = fnmatch.filter(subject, query)

but this returns an empty list. What am I doing wrong? Am I actually using the filter function of fnmatch correctly? Thank you in advance

Marius
  • 990
  • 1
  • 14
  • 34

1 Answers1

2
  • The query should be the second argument of filter, not the first
  • filter filters a list of strings by keeping the strings that match your query. filter does not return a list of substrings of a string. If you want to filter the substrings with filter, you first need to build the list of substrings:
import fnmatch
subject = "aabcddcabaabedcbabaa"
query = "b?d??ab"
substrings = fnmatch.filter((subject[i:i+len(query)] for i in range(len(subject) - len(query))), query)
print(substrings)

Output: ['bcddcab', 'bedcbab']

Stef
  • 13,242
  • 2
  • 17
  • 28
  • Ahum. Understood. Thank you for your answer. It works. But, out of curiosity, do you think there is a better/more efficient approach as the actual strings in my data will have a length of several thousand characters? (true that i did not time yet on real data examples) – Marius Sep 11 '20 at 13:15
  • Tough to say without timing it, but, rather than sliding over all the indexes in the string, if your string doesn't start with a wild car, you could start by finding just the possible start locations https://stackoverflow.com/questions/4664850/how-to-find-all-occurrences-of-a-substring – Techniquab Sep 11 '20 at 13:25
  • Yeah, hard to say. It can very well start with a wildcard, but I see what you mean: reduce the search space from the beginning based on the query substring until the first occurrence of the wildcard. I guess that in my case, to be on the safe side, I'll just generate all possible substrings. If i run into speed/memory problems then I'll rethink it. Thanks tho @Techniquab ! – Marius Sep 11 '20 at 13:30
  • 1
    I've chose it as the right answer as it shows the correct use of the `filter` function from `fnmatch`. However, maybe there is a more efficient approach (like using regex) when your data is composed of really long strings – Marius Sep 11 '20 at 14:46
  • I'm not extremely familiar with `fnmatch`, but in general it would be more efficient to work with indices (with a matching function that can be passed a string, a starting index, and a substring length; or with a matching function that explicitly searches for substring, and thuse handles the iteration itself) rather than generate copies of all the substrings with list slices. – Stef Sep 14 '20 at 08:27