1

I have a dataframe of IDs and names (2 x 1.5e6). Separately, I have a long list of vulgar words, kept in a .txt file (needs to be stored in a central location and is constantly updated).

Essentially, I am trying to 'match' the dataframe of names to the vulgar word list. I hope to create a new variable on the dataframe ('vulgar_flag'), and flag as a 0 or 1 depending on whether any of the words from the vulgar list (.txt file) are a substring of the name in the dataframe.

Currently, my approach is to read in the vulgar .txt file and create a list of words called vulgar_scrub. I then have the following code to create the flag:

df['vulgar_flag'] = numpy.where(df.FULLNAME.str.contains('|'.join(vulgar_scrub)),1,0)

This seems clunky, and I'm wondering if there are any more efficient alternatives. This post (Pandas lookup, mapping one column in a dataframe to another in a different dataframe) mentions using df.merge although I'm not sure that would support checking for substrings as I am looking for.

Mainly just curious to see if there are other solutions, or any dataframe functionality I'm unaware of. Thanks!

Le Chase
  • 170
  • 9
  • 1
    maybe just do `df['vulgar_flag']=df.FULLNAME.isin(vulgar_scrub).astype(int)` – BENY Oct 30 '18 at 22:02
  • 1
    Possible duplicate of [Pandas filtering for multiple substrings in series](https://stackoverflow.com/questions/48541444/pandas-filtering-for-multiple-substrings-in-series). In particular, for performance see [this answer](https://stackoverflow.com/a/48600345/9209546) for a trie-based method. – jpp Oct 30 '18 at 22:18

0 Answers0