0

Hopefully the same question hasn't already been answered (I looked but could not find).

I have a list of partial strings:

date_parts = ['/Year', '/Month', '/Day',....etc. ]

and I have a string. E.g.

string1 = "Tag01/Source 01/Start/Year"

or

string1 = "Tag01/Source 01/Volume"

What is the most efficient way, apart from using a for loop, to check if any of the date_parts strings are contained within the string?

For info, string1 in reality is actually another list of many strings and I would like to remove any of these strings that contain a string within the date_parts list.

blhsing
  • 91,368
  • 6
  • 71
  • 106
njminchin
  • 408
  • 3
  • 13
  • Possible duplicate of [Check if multiple strings exist in another string](https://stackoverflow.com/questions/3389574/check-if-multiple-strings-exist-in-another-string) – blhsing Feb 28 '19 at 05:06
  • 1
    Thanks blhsing, I actually saw that one but the title didn't draw me in. Looks like it's got the 'any' answer, however I like the regex answer as well :) – njminchin Feb 28 '19 at 08:35

2 Answers2

1

You can use the any function with a list comprehension. It should be a little faster than a for loop.

For one string, you can test like this:

any(p in string1 for p in date_parts)

If strings is a list of many strings you want to check, you could do this:

unmatched = [s for s in strings if not any(p in s for p in date_parts)]

or

unmatched = [s for s in strings if all(p not in s for p in date_parts)]
Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45
  • thanks for the reply! I would mark this as an answer along with glich's but only one answer.. the any method was marked as the answer in the post blhsing linked, so I gave the answer to regex in mine. cheers! – njminchin Feb 28 '19 at 08:38
  • No worries! If you really care about speed, you could try the [Aho-Corasick Algorithm](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) mentioned in the duplicate question thread. I’d be interested to see how it compares to regex. Both should be faster than `any`, but sometimes simplicity is better than speed. – Matthias Fripp Mar 01 '19 at 04:05
1

Compile a regex from the partial strings. Use re.escape() in case they contain control characters in the regex language.

import re
date_parts = ['/Year', '/Month', '/Day']
pattern = re.compile('|'.join(re.escape(s) for s in date_parts))

Then use re.search() to see if it matches.

string1 = "Tag01/Source 01/Start/Year"
re.search(pattern, string1)

The regex engine is probably faster than a native Python loop.


For your particular use case, consider concatenating all the strings, like

all_string = '\n'.join(strings+[''])

Then you can do them all at once in a single call to the regex engine.

pattern = '|'.join(f'.*{re.escape(s)}.*\n' for s in date_parts)
strings = re.sub(pattern, '', all_string).split('\n')[:-1]

Of course, this assumes that none of your strings has a '\n'. You could pick some other character that's not in your strings to join and split on if necessary. '\f', for example, should be pretty rare. Here's how you might do it with '@'.

all_string = '@'.join(strings+[''])
pattern = '|'.join(f'[^@]*{re.escape(s)}[^@]*@' for s in date_parts)
strings = re.sub(pattern, '', all_string).split('@')[:-1]

If that's still not fast enough, you could try a faster regex engine, like rure.

gilch
  • 10,813
  • 1
  • 23
  • 28