1

I want to filter out certain words from a pandas dataframe column and make a new column of the filtered text. I attempted the solution from here, but I think im having the issue of python thinking that I want to call the str.replace() instead of df.replace(). I'm not sure how to specify the latter as long as I'm calling it within a function.

df:

id     old_text 
0      my favorite color is blue
1      you have a dog
2      we built the house ourselves
3      i will visit you
def removeWords(txt):
     words = ['i', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself']
     txt = txt.replace('|'.join(words), '', regex=True)
     return txt

df['new_text'] = df['old_text'].apply(removeWords)

error:

TypeError: replace() takes no keyword arguments

desired output:

id     old_text                         new_text
0      my favorite color is blue        favorite color is blue
1      you have a dog                   have a dog
2      we built the house ourselves     built the house 
3      i will visit you                 will visit you

other things tried:

def removeWords(txt):
     words = ['i', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself']
     txt = [word for word in txt.split() if word not in words]
     return txt

df['new_text'] = df['old_text'].apply(removeWords)

this returns:

id     old_text                         new_text
0      my favorite color is blue        favorite, color, is, blue
1      you have a dog                   have, a, dog
2      we built the house ourselves     built, the, house 
3      i will visit you                 will, visit, you
Jacob Myer
  • 479
  • 5
  • 22
  • Instead of using a function and `apply`, just use the built in method [series.str.replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html) as in `df['new_text'] = df['old_text'].str.replace(args)` – G. Anderson Oct 23 '20 at 15:06
  • I certainly could. However I'm trying to follow a convention and I'd like to understand this issue better – Jacob Myer Oct 23 '20 at 15:08

1 Answers1

2

From this line:

txt.replace(rf"\b({'|'.join(words)})\b", '', regex=True)

This is the signature for pd.Series.replace so your function takes a series as input. On the other hand,

df['old_text'].apply(removeWords)

applies the function to each cell of df['old_text']. That means, txt would be just a string, and the signature for str.replace does not have keyword arguments (regex=True) in this case.

TLDR, you want to do:

df['new_text'] = removeWords(df['old_text'])

Output:

   id                      old_text                new_text
0   0     my favorite color is blue    favorte color s blue
1   1                you have a dog              have a dog
2   2  we built the house ourselves   bult the house selves
3   3              i will visit you                wll vst 

But as you can see, your function replaces the i within the words. You may want to modify the pattern so as it only replaces the whole words with the boundary indicator \b:

def removeWords(txt):
    words = ['i', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself']
    
    # note the `\b` here
    return txt.replace(rf"\b({'|'.join(words)})\b", '', regex=True)

Output:

   id                      old_text                 new_text
0   0     my favorite color is blue   favorite color is blue
1   1                you have a dog               have a dog
2   2  we built the house ourselves         built the house 
3   3              i will visit you              will visit 
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74