1

This does not need to necessarily be done in pandas but it would be nice if it could be done in pandas.

Say I have a list or Series of strings:

['XXY8779','0060-19','McChicken','456728']

And I have another list or Series which contains sub-strings of the original like so:

['60-19','Chicken','8779','1124231','92871','johnson']

And this would return something like:

[True, True, True, False]

I'm looking for a match that is something like:

^[a-zA-Z0-9.,$;]+ < matching string in other list >

So in other words, something that starts with 1 or more of any character but the rest matches exactly with one of the strings in my other list.

Does anyone have any ideas on the best way to accomplish this?

Thanks!

doddy
  • 579
  • 5
  • 18
  • Will the matches always be with the end of the strings in your first list? – ALollz Jun 28 '18 at 14:14
  • Related: the standard Pandas algorithm is not particularly efficient. If you need performance, consider a trie-based method, e.g. [this Aho-Corassick solution](https://stackoverflow.com/a/48600345/9209546). – jpp Jun 28 '18 at 14:19
  • @ALollz yes, always at the end. – doddy Jun 28 '18 at 14:21

2 Answers2

6

Use str.contains

'|'.join(s2) produces a string that tells contains to use regex and use or logic.

s1 = pd.Series(['XXY8779', '0060-19', 'McChicken', '456728'])

s2 = ['60-19', 'Chicken', '8779', '1124231', '92871', 'johnson']

s1.str.contains('|'.join(s2))

0     True
1     True
2     True
3    False
dtype: bool
piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

Since it's always at the end you can use .str.endswith and any to short-circuit the logic. s1 and s2 are just your lists above (but it also works if they are pd.Series)

[any(i.endswith(j) for j in s2) for i in s1]
#[True, True, True, False]

You can then convert it to a series with pd.Series or just use that list as a mask as-is.

ALollz
  • 57,915
  • 7
  • 66
  • 89