Find any word of a list in the column of dataframe

Question

I have a list of words negative that has 4783 elements. I want to use the following code

tweets3 = tweets2[tweets2['full_text'].str.contains('|'.join(negative))]

But, it gives ane error like this error: multiple repeat at position 4193.

I do not understand this error. Apparently, if I use a single word in str.contains such as str.contains("deal") I am able to get results.

All I need is a new dataframe that carries only those rows which carry any of the words occuring in the dataframe tweets2 column full_text.

As a matter of choice I would also like to see if I can have a boolean column for present and absent values as 0 or 1.

I arrived at using the following code with the help of @wp78de:

tweets2['negative'] = tweets2.loc[tweets2['full_text'].str.contains(r'(?:{})'.format('|'.join(negative)), regex=True, na=False)].copy()

Maybe `.str.contains(r'(?:{})'.format('|'.join(words)), regex=True, na=False)]` — wp78de, Mar 07 '20 at 11:46
finally figured out what went wrong. Out of 4783 elements of negative words some were spelt like `f**k, bull****, ***hole` and these created problems for the `regex` to work. — ambrish dhaka, Mar 08 '20 at 12:01
I hope this helps: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas — wp78de, Mar 09 '20 at 08:43

score 1 · Accepted Answer · answered Mar 08 '20 at 13:32

For arbitrary literal strings that may have regular expression metacharacters in it you can use the re.escape() function. Something along this line should be sufficient:

.str.contains(r'(?:{})'.format(re.escape('|'.join(words)), regex=True, na=False)]

Find any word of a list in the column of dataframe

1 Answers1

Linked