How to remove listed entries from dataframe by row?

Question

I have two separate dataframes with ~100k rows each. One dataframe contains a list (column title "list_A") of column names that meet criteria A, the other (column title "list_B") has a list of names that fail to meet criteria B (calculated from separate information specific to their relative dataframes). I'm trying to create a list of names that meet both criteria by removing the names in list B from list A without using a loop. Is this possible?

For instance, pulling the column "list_A" may be something like this:

    [['X','Y','Z','A'],
     ['X','Y','Z','A'],
     ['Y','Z','A']...]

And "list_B" may be something like this:

    [['Z'],
     [],
     ['A']...]

And I'd like to end up with this:

    [['X','Y','A'],
     ['X','Y','Z','A'],
     ['Y','Z']...]

Is there a way to do this without a time-expensive for loop?

This should help answer your question: https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe — Brian Sunbury, Jan 24 '19 at 19:36
Are these in a pandas DataFrame or a list of lists as shown? and do the dataframes have the same length (i.e. does row 1 in df1 match against row 1 in df2?) — Sven Harris, Jan 24 '19 at 19:38
The lists are columns in a pandas DataFrame. Yes they have the same length and rows match. — cbruno, Jan 24 '19 at 19:41
Yes, each entry is a single list of the columns names in the DataFrame that meet criteria B — cbruno, Jan 24 '19 at 20:00

score 3 · Answer 1 · answered Jan 24 '19 at 20:01

3

Try this if order doesn't matter

df['list_A'] = df.apply(lambda x: list(set(x['list_A']) - set(x['list_B'])), axis=1)

answered Jan 24 '19 at 20:01

Kenan

13,156
8
43
50

aha beat me to it! – Sven Harris Jan 24 '19 at 20:03

Sven Harris · Accepted Answer · 2019-01-24T20:07:21.260

3

You can do it in the following way, (perhaps more performant ways are possible but the lists within columns doesn't tend to lend itself to high speed vectorised operations)

df = pd.DataFrame({"a":[['X','Y','Z','A'],['X','Y','Z','A'],['Y','Z','A']], "b":[['Z'], [], ['A']]})

df.apply(lambda x: list(set(x["a"]).difference(set(x["b"]))), axis=1)

edited Jan 24 '19 at 20:07

answered Jan 24 '19 at 20:03

Sven Harris

2,884
1
10
20

To be exact, the first line should be `df = pd.DataFrame({"a":[['X','Y','Z','A'],['X','Y','Z','A'],['Y','Z','A']], "b":[['Z'], [], ['A']]})` to match desired output. Nevertheless, works all the same. – WGS Jan 24 '19 at 20:05
@JeromeMontino quite right, I've updated my answer. I bet you always read the question in exams ;) – Sven Harris Jan 24 '19 at 20:07
Lol, how I wish. Code is succinct and performant, either way. Nice. ;) – WGS Jan 24 '19 at 20:10

How to remove listed entries from dataframe by row?

2 Answers2