Combine list that shares common value from two dataframe columns

Question

Let's say you have two data frame columns that have some list:

What will the code in python looks like?

What is the logic which leads to the last df? Why do we have two rows? — aprospero, Mar 06 '21 at 08:23
@aprospero the first row is for where the id that have list that share common element and the column two is what the element that share common element — pangu70, Mar 06 '21 at 08:28

score 1 · Answer 1 · answered Mar 06 '21 at 08:38

df1 = pd.DataFrame(
    {
        'Id':[1,2,3],
        'a':[[1,2,3],[4, 5],[6]]
    }, 
)
df2 = pd.DataFrame(
    {
        'Id':[1,2,3],
        'b':[[3],[6, 7],[8]]
    }, 
)


df3 = pd.DataFrame(
    {
        'Id': [
            [int(df1[pd.Series(map(tuple, df1['a']))  == tuple(i)].index.values)+1, 
             int(df1[pd.Series(map(tuple, df2['b']))  == tuple(j)].index.values)+1] 
            for j in df2['b'] for i in df1['a'] if set.intersection(set(i), set(j))],
        'a&b': [list(set.union(set(i), set(j))) 
                for j in df2['b'] 
                for i in df1['a'] 
                if set.intersection(set(i), set(j))],
    }
)

:OUPTUT:

I know my solution is very messy but when I try to solve to in other way like without use of tuple compression which one is very easy then I got various kind of error like ValueError: Lengths must match to compare which is common we all know about this type of

`(df1['a'])) == i).index.values` logic is better then `int(df1[pd.Series(map(tuple, df1['a'])) == tuple(i)` but first one cause lots of error and bugs — Davinder Singh, Mar 06 '21 at 08:41
it says 'Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)'. I want to do it in dataframe that has different length. — pangu70, Mar 06 '21 at 09:17

Rob Raymond · Answer 2 · 2021-03-06T09:51:00.627

start with a Cartesian product between df and df1
next find rows where b is a subset of a subset
use set.union() to get superset of a and b
build list of combined id columns
a bit of cleanup - dropna() and just columns you want loc[]

df = pd.DataFrame({"id":[1,2,3],
             "a":[[1,2,3],[4,5],[6]]})

df1 = pd.DataFrame({"id":[1,2,3],
             "b":[[3],[6,7],[8]]})

df2 = (df.assign(foo=1)
 .merge(df1.assign(foo=1), on="foo")
 .assign(**{"a&b":lambda dfa: np.where(dfa.apply(lambda r: any(x in r.a for x in r.b), axis=1),
                                     dfa.apply(lambda r: list(set(r.a).union(r.b)), axis=1),
                                     np.nan)})
 .dropna()
 .assign(id=lambda dfa: dfa.loc[:,["id_x","id_y"]].apply(list, axis=1))
 .loc[:,["id","a&b"]]
)

	id	a&b
0	[1, 1]	[1, 2, 3]
7	[3, 2]	[6, 7]

alternate approach

this eliminates the combination explosion of a Cartesian product
copy the list before it's exploded so it can be easily reconstructed

(df.assign(a_arr=df.a).explode("a")
 .merge(df1.assign(b_arr=df1.b).explode("b"), left_on="a", right_on="b")
 .assign(**{"id":lambda dfa: dfa.loc[:,["id_x","id_y"]].apply(list, axis=1),
           "a&b":lambda dfa: dfa.loc[:,["a_arr","b_arr"]].apply(lambda r: list(set(r.a_arr).union(r.b_arr)), axis=1)})
 .loc[:,["id","a&b"]]
)

It say 'Unable to allocate 26.2 GiB for an array with shape (3513454770,) and data type int64'. What happened? — pangu70, Mar 06 '21 at 09:19
it's the Cartesian product - it will generate `len(df) * len(df1)` rows. is there anything in your DFs that allows way to reduce number of possible combinations to be reduced? — Rob Raymond, Mar 06 '21 at 09:33
I've updated with an alternate approach that does not do a Cartesian product. combinations will depend on size of lists — Rob Raymond, Mar 06 '21 at 09:51

Combine list that shares common value from two dataframe columns

2 Answers2

alternate approach