0

I have two separate dataframes with ~100k rows each. One dataframe contains a list (column title "list_A") of column names that meet criteria A, the other (column title "list_B") has a list of names that fail to meet criteria B (calculated from separate information specific to their relative dataframes). I'm trying to create a list of names that meet both criteria by removing the names in list B from list A without using a loop. Is this possible?

For instance, pulling the column "list_A" may be something like this:

    [['X','Y','Z','A'],
     ['X','Y','Z','A'],
     ['Y','Z','A']...]

And "list_B" may be something like this:

    [['Z'],
     [],
     ['A']...]

And I'd like to end up with this:

    [['X','Y','A'],
     ['X','Y','Z','A'],
     ['Y','Z']...]

Is there a way to do this without a time-expensive for loop?

cbruno
  • 71
  • 1
  • 1
  • 8
  • This should help answer your question: https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe – Brian Sunbury Jan 24 '19 at 19:36
  • 1
    Are these in a pandas DataFrame or a list of lists as shown? and do the dataframes have the same length (i.e. does row 1 in df1 match against row 1 in df2?) – Sven Harris Jan 24 '19 at 19:38
  • The lists are columns in a pandas DataFrame. Yes they have the same length and rows match. – cbruno Jan 24 '19 at 19:41
  • Does list_B have a max of 1 element? – Kenan Jan 24 '19 at 19:59
  • Yes, each entry is a single list of the columns names in the DataFrame that meet criteria B – cbruno Jan 24 '19 at 20:00

2 Answers2

3

Try this if order doesn't matter

df['list_A'] = df.apply(lambda x: list(set(x['list_A']) - set(x['list_B'])), axis=1)
Kenan
  • 13,156
  • 8
  • 43
  • 50
3

You can do it in the following way, (perhaps more performant ways are possible but the lists within columns doesn't tend to lend itself to high speed vectorised operations)

df = pd.DataFrame({"a":[['X','Y','Z','A'],['X','Y','Z','A'],['Y','Z','A']], "b":[['Z'], [], ['A']]})

df.apply(lambda x: list(set(x["a"]).difference(set(x["b"]))), axis=1)
Sven Harris
  • 2,884
  • 1
  • 10
  • 20
  • To be exact, the first line should be `df = pd.DataFrame({"a":[['X','Y','Z','A'],['X','Y','Z','A'],['Y','Z','A']], "b":[['Z'], [], ['A']]})` to match desired output. Nevertheless, works all the same. – WGS Jan 24 '19 at 20:05
  • @JeromeMontino quite right, I've updated my answer. I bet you always read the question in exams ;) – Sven Harris Jan 24 '19 at 20:07
  • Lol, how I wish. Code is succinct and performant, either way. Nice. ;) – WGS Jan 24 '19 at 20:10