Clean pandas dataframe column, remove parts from strings that are presented in some other dataframe

Question

I am trying to clean up data in one dataframe by values from other dataframe's column. The first dataframe contains a semicolon seperated list of values, the second dataframe contains single words. After cleaning the first dataframe must not contain any words from the second dataframe.

data df1                                       data df2

x1;x2;x3                                       x1
key2;key6;key7;key8                            x2
                                               key6  
                                               key8

I need to remove from data df1, values present in data df2. I am trying to convert two columns from different dfs, into two lists and remove from list1 of df1, the values present in list2 of df2.

Is there a faster way of doing this without a loop considering that data df2 column may have over 1M rows and in data df1 column I have more than one value on a row?

Patrick Artner · Accepted Answer · 2021-02-19T14:35:31.433

You can essentially do this by splitting your dataframe's colum into enough columns, and replace values:

import pandas as pd

df1 = pd.DataFrame({"a": ["x1;x2;x3", "key2;key6;key7;key8"]})
df2 = pd.DataFrame({"tbd": "x1,x2,key6,key8".split(",")})

print(df1)
print(df2)
# create a new df that contains splitted values as columns
df3 = df1["a"].str.split(";", expand=True).fillna(value="")
print(df3)

# replace non wanted values
df3.replace( df2["tbd"],"", inplace=True)
df3.replace( df3, None, "", inplace=True)
print(df3)

Output:

# df1
    a
0             x1;x2;x3
1  key2;key6;key7;key8

# df2
    tbd
0    x1
1    x2
2  key6
3  key8

# df3 (df1 column "a" after splitting into new df)
      0     1     2     3
0    x1    x2    x3  
1  key2  key6  key7  key8

# replacing all values from df3 that are in df2["tbd"]
      0     1     2     3
0                x3  
1  key2  key6  key7

You may need to collect the data again.

To clean df1 in one go you can use a list comprehension like so:

import pandas as pd

df1 = pd.DataFrame({"a": ["x1;x2;x3", "key2;key6;key7;key8"]})
df2 = pd.DataFrame({"tbd": "x1,x2,key6,key8".split(",")})


df1["a"] = [';'.join([i for i in v.split(";")             # split and recombine again
                      if i not in frozenset(df2["tbd"])]) # remove i from v if in df2
            for v in df1["a"]]                            # v == any rows of column

print(df1)


          a
0         x3
1  key2;key7

This solution could have been found as combination of answers from splitting a column by delimiter pandas python and Remove unwanted parts from strings in a column - but it is nto a pure duplicate of either.

thank you. The last step will be then to concatenate the resulted values in df3 and paste them in initial df1, right? — Isa, Feb 19 '21 at 12:28
yes, basically in df1 I have to remain with the values from df1 not present in df2 — Isa, Feb 19 '21 at 12:30
is there a way we can speed up the process since the second file contains more than 1M rows of data?Thank you so much — Isa, Mar 23 '21 at 09:23

Clean pandas dataframe column, remove parts from strings that are presented in some other dataframe

1 Answers1