python - find all occurrences of substring with wildcards in string

Question

I am trying to write a function to return all occurrences of a substring that contains wildcards (each wildcard accounting for only one character) within a longer string. For instance, let's say I have the subject string: aabcddcabaabedcbabaa and my query string is b?d??ab. The expected output would be: ['bcddcab', 'bedcbab']

Looking through other stack overflow posts, I've tried the following:

import fnmatch
subject = "aabcddcabaabedcbabaa"
query = "b?d??ab"
res = fnmatch.filter(subject, query)

but this returns an empty list. What am I doing wrong? Am I actually using the filter function of fnmatch correctly? Thank you in advance

See https://docs.python.org/3/library/fnmatch.html#fnmatch.filter. Also, globs need to match the whole string, so you need something like `query = "*b?d??ab*"` — Wiktor Stribiżew, Sep 11 '20 at 13:03
@Techniquab ah yes. sorry. was a mistake while writing the question. in my code they are in the right order. updated the question — Marius, Sep 11 '20 at 13:10
@WiktorStribiżew thanks for your comment. it doesn't work though — Marius, Sep 11 '20 at 13:12

score 2 · Accepted Answer · answered Sep 11 '20 at 13:08

2

The query should be the second argument of filter, not the first
filter filters a list of strings by keeping the strings that match your query. filter does not return a list of substrings of a string. If you want to filter the substrings with filter, you first need to build the list of substrings:

import fnmatch
subject = "aabcddcabaabedcbabaa"
query = "b?d??ab"
substrings = fnmatch.filter((subject[i:i+len(query)] for i in range(len(subject) - len(query))), query)
print(substrings)

Output: ['bcddcab', 'bedcbab']

answered Sep 11 '20 at 13:08

Stef

13,242
2
17
28

Ahum. Understood. Thank you for your answer. It works. But, out of curiosity, do you think there is a better/more efficient approach as the actual strings in my data will have a length of several thousand characters? (true that i did not time yet on real data examples) – Marius Sep 11 '20 at 13:15
Tough to say without timing it, but, rather than sliding over all the indexes in the string, if your string doesn't start with a wild car, you could start by finding just the possible start locations https://stackoverflow.com/questions/4664850/how-to-find-all-occurrences-of-a-substring – Techniquab Sep 11 '20 at 13:25
Yeah, hard to say. It can very well start with a wildcard, but I see what you mean: reduce the search space from the beginning based on the query substring until the first occurrence of the wildcard. I guess that in my case, to be on the safe side, I'll just generate all possible substrings. If i run into speed/memory problems then I'll rethink it. Thanks tho @Techniquab ! – Marius Sep 11 '20 at 13:30
1

I've chose it as the right answer as it shows the correct use of the `filter` function from `fnmatch`. However, maybe there is a more efficient approach (like using regex) when your data is composed of really long strings – Marius Sep 11 '20 at 14:46
I'm not extremely familiar with `fnmatch`, but in general it would be more efficient to work with indices (with a matching function that can be passed a string, a starting index, and a substring length; or with a matching function that explicitly searches for substring, and thuse handles the iteration itself) rather than generate copies of all the substrings with list slices. – Stef Sep 14 '20 at 08:27

python - find all occurrences of substring with wildcards in string

1 Answers1