I have a dataframe of IDs and names (2 x 1.5e6). Separately, I have a long list of vulgar words, kept in a .txt file (needs to be stored in a central location and is constantly updated).
Essentially, I am trying to 'match' the dataframe of names to the vulgar word list. I hope to create a new variable on the dataframe ('vulgar_flag'), and flag as a 0 or 1 depending on whether any of the words from the vulgar list (.txt file) are a substring of the name in the dataframe.
Currently, my approach is to read in the vulgar .txt file and create a list of words called vulgar_scrub
. I then have the following code to create the flag:
df['vulgar_flag'] = numpy.where(df.FULLNAME.str.contains('|'.join(vulgar_scrub)),1,0)
This seems clunky, and I'm wondering if there are any more efficient alternatives. This post (Pandas lookup, mapping one column in a dataframe to another in a different dataframe) mentions using df.merge
although I'm not sure that would support checking for substrings as I am looking for.
Mainly just curious to see if there are other solutions, or any dataframe functionality I'm unaware of. Thanks!