0

Consider the code:

df = pd.read_csv('...csv')
array = [.....,....,....]
results = df[df.Message.isin(array).fillna(False)]

The column Message contains more than one word.

How can we get all rows that have the column "Message" where at least one of the words in Message is in the array ?

Example:

Client  Message                             City      Phone 
 
Jackson I will back soon                    Rome      1111
Cole    Please try to be patient            Cairo     2222 
Rains   Sure anything you want , anything   Paris     3333  

Array = ['try', 'anything', 'patient']

Result:

Cole    Please try to be patient            Cairo     2222 
Rains   Sure anything you want , anything   Paris     3333  
JAN
  • 21,236
  • 66
  • 181
  • 318

3 Answers3

1

Maybe something like this (in a single line without loops):

import pandas as pd

data = [['Client','Message','City','Phone'],
['Jackson','I will back soon','Rome',1111],
['Cole','Please try to be patient','Cairo',2222 ],
['Rains','Sure anything you want , anything','Paris',3333  ]]

Array = ['try', 'anything', 'patient']

df = pd.DataFrame(data[1:], columns=data[0])

print (df[df['Message'].str.contains('|'.join(Array))])

Inspired by How to test if a string contains one of the substrings in a list, in pandas? and Select by partial string from a pandas DataFrame

DSteman
  • 1,388
  • 2
  • 12
  • 25
1

Something like this would solve your problem:

Array = ['try', 'anything', 'patient']

def find_words(element):
     for word in Array:
         if word in element:
             return True
     return False

results = df[df["Message"].apply(find_words)]

Edit:

In addition to the above, we can also pull it off with a one-liner. This is not an elegant solution, But it works :)

Array = ['try', 'anything', 'patient']

results = df[df["Message"].apply(lambda x: True if word in x else False for word in Array).any(1)]
Animikh Aich
  • 598
  • 1
  • 6
  • 15
1
import numpy as np
import random
import pandas as pd

def generate_sample_data(words_to_be_detected):
  ## Generate Random words
  num_random_words = 20
  random_words = [''.join(np.random.choice(letters, np.random.randint(2,5)))
                  for _ in range(np.random.randint(num_random_words))] 

  ## Combine lists
  list_of_possible_words = random_words + words_to_be_detected

  ## define number of rows to be generated
  df_num_rows = 5

  ## Generate sample data
  data_sample_dict = [
      {"Message": list(np.random.choice(list_of_possible_words, np.random.randint(0,5)))}
      for _ in range(df_num_rows)         
  ]

  return pd.DataFrame(data_sample_dict)


## Define words to be found at Series
words_to_be_detected = ['XXXX', 'YYYY']

## Generate Synthetic Data
df = generate_sample_data(words_to_be_detected)

contains_str_mask = df.Message.astype(str).str.contains('|'.join(words_to_be_detected))

display(df)

print('only the ones we are looking fore')
display(df[contains_str_mask])

the summary is:

df.Message.astype(str).str.contains('|'.join(words_to_be_detected))

Raul Medeiros
  • 224
  • 2
  • 5