1

How can I compare lists within two columns of a dataframe and identify if the elements of one list is within the other list and create another column with the missing elements.

The dataframe looks something like this:

df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
               'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
               'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
               'D': ['d1', 'd2', 'd3']})

I want to compare if elements of column C are in column B and output the missing values to column E, the desired output is:

df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
               'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
               'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
               'D': ['d1', 'd2', 'd3']
               'E': ['b2', ['b1','b2'],'']})
coding_monkey
  • 397
  • 7
  • 18

4 Answers4

4

Like your previous related question, you can use a list comprehension. As a general rule, you shouldn't force multiple different types of output, e.g. list or str, depending on result. Therefore, I have chosen lists throughout in this solution.

df['E'] = [list(set(x) - set(y)) for x, y in zip(df['B'], df['C'])]

print(df)

    A             B         C   D         E
0  a1      [b1, b2]  [c1, b1]  d1      [b2]
1  a2  [b1, b2, b3]      [b3]  d2  [b1, b2]
2  a3          [b2]  [b2, b1]  d3        []
jpp
  • 159,742
  • 34
  • 281
  • 339
2
def Desintersection(i):

    Output = [b for b in df['B'][i] if b not in df['C'][i]]

    if(len(Output) == 0):

        return ''

    elif(len(Output) == 1):

        return Output[0]

    else:

        return Output



df['E'] = df.index.map(Desintersection)


df

enter image description here

L. B.
  • 430
  • 3
  • 14
1

Like what I do for my previous answer

(df.B.map(set)-df.C.map(set)).map(list)
Out[112]: 
0        [b2]
1    [b2, b1]
2          []
dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    Haven't tested the timings, but will likely be very slow, even though it is pretty clean ;} – rafaelc Dec 20 '18 at 23:40
  • @RafaelC thank you , I think this is more easy for people who just knowing pandas know the vectorized way – BENY Dec 21 '18 at 01:41
0

I agree with @jpp that you shouldn't mix the types so much, as when you try to apply the same function to the new E column, it will fail, cause it expected each element to be a list.

This would work on E, as it converts single str values to [str] before comparison.

import pandas as pd

df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
                   'B': [['b1', 'b2'], ['b1', 'b2', 'b3'], ['b2']],
                   'C': [['c1', 'b1'], ['b3'], ['b2', 'b1']],
                   'D': ['d1', 'd2', 'd3']})


def difference(df, A, B):
    elements_to_list = lambda x: [n if isinstance(n, list) else [n] for n in x]
    diff = [list(set(a).difference(set(b))) for a, b in zip(elements_to_list(df[A]), elements_to_list(df[B]))]
    diff = [d if d else "" for d in diff]  # replace empty lists with empty strings
    return [d if len(d) != 1 else d[0] for d in diff]  # return with single values extracted from the list


df['E'] = difference(df, "B", "C")
df['F'] = difference(df, "B", "E")
print(list(df['E']))
print(list(df['F']))

['b2', ['b2', 'b1'], '']
['b1', 'b3', 'b2']
ebro42
  • 81
  • 1
  • 3