1

Supposing that I have a pandas.DataFrame and a list of strings:

fruit = ['apple', 'banana', 'cherry' ] # etc

Is there a neat way to select rows in which the value of that column contains (has as a substring) one of the strings from the list?

I'm aware of isin, which checks the whole value of the column against a list of possibilities:

df[df['column'].isin(fruit)]

and contains, which matches it against a single string:

df[df['column'].str.contains('apple')]

Neither of those seems to be quite what I want. In other languages I could do something with a loop, but I'm quite new to the pandas style of working with data so unsure how to proceed. The list of strings is also quite large (a couple of thousand items) so compiling it into a regex to match against doesn't seem like it would be a good idea.

Scott Martin
  • 1,260
  • 2
  • 17
  • 27
  • Ah, are you looking to to substring matching? – pault Sep 12 '18 at 15:55
  • 1
    You are right, even regex might be slow here. In particular, the [trie-based Aho-Corasick solution](https://stackoverflow.com/a/48600345/9209546) in the marked duplicate will give huge performance benefits. – jpp Sep 12 '18 at 15:57
  • @pault - Yeah, I am. I've made a note of that in the question to clarify. – Scott Martin Sep 12 '18 at 16:01
  • 1
    @jpp - Great! I didn't find that question while I was searching before asking. Thanks. – Scott Martin Sep 12 '18 at 16:02

0 Answers0