What I need:
I have a dataframe where the elements of a column are lists. There are no duplications of elements in a list. For example, a dataframe like the following:
import pandas as pd
>>d = {'col1': [[1, 2, 4, 8], [15, 16, 17], [18, 3], [2, 19], [10, 4]]}
>>df = pd.DataFrame(data=d)
col1
0 [1, 2, 4, 8]
1 [15, 16, 17]
2 [18, 3]
3 [2, 19]
4 [10, 4]
I would like to obtain a dataframe where, if at least a number contained in a list at row i
is also contained in a list at row j
, then the two list are merged (without duplication). But the values could also be shared by more than two lists, in that case I want all lists that share at least a value to be merged.
col1
0 [1, 2, 4, 8, 19, 10]
1 [15, 16, 17]
2 [18, 3]
The order of the rows of the output dataframe, nor the values inside a list is important.
What I tried:
I have found this answer, that shows how to tell if at least one item in list is contained in another list, e.g.
>>not set([1, 2, 4, 8]).isdisjoint([2, 19])
True
Returns True, since 2 is contained in both lists.
I have also found this useful answer that shows how to compare each row of a dataframe with each other. The answer applies a custom function to each row of the dataframe using a lambda.
df.apply(lambda row: func(row['col1']), axis=1)
However I'm not sure how to put this two things together, how to create the func
method. Also I don't know if this approach is even feasible since the resulting rows will probably be less than the ones of the original dataframe.
Thanks!