python how to match partial strings between two unequal sized columns

Question

Very new to Python here, and still am quite not fully understanding how to use Python correctly, so please bear with my stupidity here.

Let's say we have a dataframe like this:

samp_data = pd.DataFrame([[1,'hello there',3],
                             [4,'im just saying hello',6],
                             [7,'but sometimes i say bye',9],
                             [2,'random words here',5]],
                            columns=["a", "b", "c"])
print(samp_data)
   a                        b  c
0  1              hello there  3
1  4     im just saying hello  6
2  7  but sometimes i say bye  9
3  2        random words here  5

and we set a list of words we dont want:

unwanted_words = ['hello', 'random']

I want to write a function that will exclude all rows where column b contains any words in the "unwanted_words" list. So the output should be:

print(samp_data)
   a                        b  c
2  7  but sometimes i say bye  9

what i've tried so far include using the built in "isin()" function:

data = samp_data.ix[samp_data['b'].isin(unwanted_words),:]

but this does not exclude the rows as i expected; and I tried using the str.contains() function:

for i,row in samp_data.iterrows():
    if unwanted_words.str.contains(row['b']).any():
        print('found matching words')

and this would throw me errors.

i think i'm just complicating things and there must be some really easy way out there that I am not aware of. any help is greatly appreciated!

posts i read into so far (not limited to this list, as i closed many windows already):

You should tag your question with "panda". This is not pure Python. — glenfant, Aug 25 '17 at 19:47

score 2 · Accepted Answer · answered Aug 25 '17 at 20:03

2

You were actually really close to the solution. It uses the method Series.str.contains. Just remember that it allows for regular expression:

samp_data[~samp_data['b'].str.contains(r'hello|random')]

Result will be:

Out [11]:
    a                         b c
2   7   but sometimes i say bye 9

answered Aug 25 '17 at 20:03

Han

36
2

wow thanks! i like your solution the best! one line and similar to what i was thinking. – alwaysaskingquestions Aug 25 '17 at 20:27

score 1 · Answer 2 · answered Aug 25 '17 at 19:59

Perhaps not the most elegant but I think it will work for you?

def in_excluded(my_str, excluded):
    """
    (str) -> bool
    """
    for each in my_str:
        if each in excluded:
            return True
    return False


def print_only_wanted(samp_data, excluded):
    """
    (list, list) -> None
    Prints each of the lists in the main list unless they contain a word 
    from excluded
    """
    for each in samp_data:
        if not in_excluded(each, excluded):
            print each

score 1 · Answer 3 · answered Aug 25 '17 at 20:00

You can use in to determine whether one string can be found within another string. For example, "he" in "hello" will return True. You can combine this with a list comprehension and the any function to select the rows you want:

df_sub = samp_data.loc[samp_data['b'].apply(lambda x: not(any([badword in x for badword in unwanted_words]))]

score 1 · Answer 4 · answered Aug 25 '17 at 20:01

You can use str.contains

samp_data = samp_data[~samp_data.b.str.contains('hello|random')]

You get

    a   b                       c
2   7   but sometimes i say bye 9

If your list of unwanted words is longer, you may want to use

unwanted_words = ['hello', 'random']
samp_data = samp_data[~samp_data.b.str.contains('|'.join(unwanted_words))]

score 0 · Answer 5 · answered Aug 25 '17 at 19:53

How about this one-liner? I am sure some of the other pandas enthusiasts will have some niftier answers than me.

samp_data[~samp_data['b'].apply(lambda x: any(word in unwanted_words for word in x.split()))]

   a                        b  c
2  7  but sometimes i say bye  9

python how to match partial strings between two unequal sized columns

5 Answers5