2

I have created a simple dataframe in python with these columns

Columns: [index, bulletintype, category, companyname, date, url] 

I have a simple array with company

companies= [x,y,x]

I would like to create a subset of the dataframe if the column 'companyname' matches on one or more of the names in the companies array.

subset = df[df['companyname'].isin(companies)]

This works pretty great but .isin makes an exact match and my sources don't use the same names. So I'm looking for an alternative angle and would like to use parts of the name to compare. I'm familiar with .str.contains('part of the name') but I can't use this functions in conjunction with an array. Can somebody help me to achieve something like this (but with working code :-)

subset = df[df['companyname'].contains(companies)]
anky
  • 74,114
  • 11
  • 41
  • 70
bsparks
  • 61
  • 2

1 Answers1

2

Try creating a regex pattern by joining your companies list with the regex OR character | then use series.str.contains as a boolean mask:

companies = ['x', 'y', 'z']
pat = '|'.join(companies)
df[df['companies'].str.contains(pat)]
Chris Adams
  • 18,389
  • 4
  • 22
  • 39