1

Very new to Python here, and still am quite not fully understanding how to use Python correctly, so please bear with my stupidity here.

Let's say we have a dataframe like this:

samp_data = pd.DataFrame([[1,'hello there',3],
                             [4,'im just saying hello',6],
                             [7,'but sometimes i say bye',9],
                             [2,'random words here',5]],
                            columns=["a", "b", "c"])
print(samp_data)
   a                        b  c
0  1              hello there  3
1  4     im just saying hello  6
2  7  but sometimes i say bye  9
3  2        random words here  5

and we set a list of words we dont want:

unwanted_words = ['hello', 'random']

I want to write a function that will exclude all rows where column b contains any words in the "unwanted_words" list. So the output should be:

print(samp_data)
   a                        b  c
2  7  but sometimes i say bye  9

what i've tried so far include using the built in "isin()" function:

data = samp_data.ix[samp_data['b'].isin(unwanted_words),:]

but this does not exclude the rows as i expected; and I tried using the str.contains() function:

for i,row in samp_data.iterrows():
    if unwanted_words.str.contains(row['b']).any():
        print('found matching words')

and this would throw me errors.

i think i'm just complicating things and there must be some really easy way out there that I am not aware of. any help is greatly appreciated!

posts i read into so far (not limited to this list, as i closed many windows already):

alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49

5 Answers5

2

You were actually really close to the solution. It uses the method Series.str.contains. Just remember that it allows for regular expression:

samp_data[~samp_data['b'].str.contains(r'hello|random')]

Result will be:

Out [11]:
    a                         b c
2   7   but sometimes i say bye 9
Han
  • 36
  • 2
1

Perhaps not the most elegant but I think it will work for you?

def in_excluded(my_str, excluded):
    """
    (str) -> bool
    """
    for each in my_str:
        if each in excluded:
            return True
    return False


def print_only_wanted(samp_data, excluded):
    """
    (list, list) -> None
    Prints each of the lists in the main list unless they contain a word 
    from excluded
    """
    for each in samp_data:
        if not in_excluded(each, excluded):
            print each
srattigan
  • 665
  • 5
  • 17
1

You can use in to determine whether one string can be found within another string. For example, "he" in "hello" will return True. You can combine this with a list comprehension and the any function to select the rows you want:

df_sub = samp_data.loc[samp_data['b'].apply(lambda x: not(any([badword in x for badword in unwanted_words]))]
madmapper
  • 46
  • 2
1

You can use str.contains

samp_data = samp_data[~samp_data.b.str.contains('hello|random')]

You get

    a   b                       c
2   7   but sometimes i say bye 9

If your list of unwanted words is longer, you may want to use

unwanted_words = ['hello', 'random']
samp_data = samp_data[~samp_data.b.str.contains('|'.join(unwanted_words))]
Vaishali
  • 37,545
  • 5
  • 58
  • 86
0

How about this one-liner? I am sure some of the other pandas enthusiasts will have some niftier answers than me.

samp_data[~samp_data['b'].apply(lambda x: any(word in unwanted_words for word in x.split()))]

   a                        b  c
2  7  but sometimes i say bye  9
gold_cy
  • 13,648
  • 3
  • 23
  • 45