Compare nested list values within columns of a dataframe

Question

How can I compare lists within two columns of a dataframe and identify if the elements of one list is within the other list and create another column with the missing elements.

The dataframe looks something like this:

df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
               'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
               'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
               'D': ['d1', 'd2', 'd3']})

I want to compare if elements of column C are in column B and output the missing values to column E, the desired output is:

df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
               'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
               'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
               'D': ['d1', 'd2', 'd3']
               'E': ['b2', ['b1','b2'],'']})

score 4 · Accepted Answer · answered Dec 20 '18 at 23:12

Like your previous related question, you can use a list comprehension. As a general rule, you shouldn't force multiple different types of output, e.g. list or str, depending on result. Therefore, I have chosen lists throughout in this solution.

df['E'] = [list(set(x) - set(y)) for x, y in zip(df['B'], df['C'])]

print(df)

    A             B         C   D         E
0  a1      [b1, b2]  [c1, b1]  d1      [b2]
1  a2  [b1, b2, b3]      [b3]  d2  [b1, b2]
2  a3          [b2]  [b2, b1]  d3        []

Nice approach if orders don't matter ;} – rafaelc Dec 20 '18 at 23:35 — rafaelc, Dec 20 '18 at 23:35

L. B. · Answer 2 · 2018-12-20T23:38:45.340

2

def Desintersection(i):

    Output = [b for b in df['B'][i] if b not in df['C'][i]]

    if(len(Output) == 0):

        return ''

    elif(len(Output) == 1):

        return Output[0]

    else:

        return Output



df['E'] = df.index.map(Desintersection)


df

edited Dec 20 '18 at 23:38

answered Dec 20 '18 at 22:57

L. B.

430
3
14

1

Pretty neat way! – pizza lover Dec 20 '18 at 23:04
1

No need to wrap with a lambda, just `df.index.map(Desintersection)` – rafaelc Dec 20 '18 at 23:33

score 1 · Answer 3 · answered Dec 20 '18 at 23:37

1

Like what I do for my previous answer

(df.B.map(set)-df.C.map(set)).map(list)
Out[112]: 
0        [b2]
1    [b2, b1]
2          []
dtype: object

answered Dec 20 '18 at 23:37

BENY

317,841
20
164
234

1

Haven't tested the timings, but will likely be very slow, even though it is pretty clean ;} – rafaelc Dec 20 '18 at 23:40
@RafaelC thank you , I think this is more easy for people who just knowing pandas know the vectorized way – BENY Dec 21 '18 at 01:41

score 0 · Answer 4 · answered Dec 20 '18 at 23:45

I agree with @jpp that you shouldn't mix the types so much, as when you try to apply the same function to the new E column, it will fail, cause it expected each element to be a list.

This would work on E, as it converts single str values to [str] before comparison.

import pandas as pd

df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
                   'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
                   'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
                   'D': ['d1', 'd2', 'd3']})


def difference(df, A, B):
    elements_to_list = lambda x: [n if isinstance(n, list) else [n] for n in x]
    diff = [list(set(a).difference(set(b))) for a, b in zip(elements_to_list(df[A]), elements_to_list(df[B]))]
    diff = [d if d else "" for d in diff]  # replace empty lists with empty strings
    return [d if len(d) != 1 else d[0] for d in diff]  # return with single values extracted from the list


df['E'] = difference(df, "B", "C")
df['F'] = difference(df, "B", "E")
print(list(df['E']))
print(list(df['F']))

['b2', ['b2', 'b1'], '']
['b1', 'b3', 'b2']

Compare nested list values within columns of a dataframe

4 Answers4