0

Let's say you have two data frame columns that have some list:

enter image description here


What will the code in python looks like?

pangu70
  • 27
  • 2
  • What is the logic which leads to the last df? Why do we have two rows? – aprospero Mar 06 '21 at 08:23
  • @aprospero the first row is for where the id that have list that share common element and the column two is what the element that share common element – pangu70 Mar 06 '21 at 08:28

2 Answers2

1
df1 = pd.DataFrame(
    {
        'Id':[1,2,3],
        'a':[[1,2,3],[4, 5],[6]]
    }, 
)
df2 = pd.DataFrame(
    {
        'Id':[1,2,3],
        'b':[[3],[6, 7],[8]]
    }, 
)


df3 = pd.DataFrame(
    {
        'Id': [
            [int(df1[pd.Series(map(tuple, df1['a']))  == tuple(i)].index.values)+1, 
             int(df1[pd.Series(map(tuple, df2['b']))  == tuple(j)].index.values)+1] 
            for j in df2['b'] for i in df1['a'] if set.intersection(set(i), set(j))],
        'a&b': [list(set.union(set(i), set(j))) 
                for j in df2['b'] 
                for i in df1['a'] 
                if set.intersection(set(i), set(j))],
    }
)

:OUPTUT:
enter image description here

I know my solution is very messy but when I try to solve to in other way like without use of tuple compression which one is very easy then I got various kind of error like ValueError: Lengths must match to compare which is common we all know about this type of

Davinder Singh
  • 2,060
  • 2
  • 7
  • 22
  • `(df1['a'])) == i).index.values` logic is better then `int(df1[pd.Series(map(tuple, df1['a'])) == tuple(i)` but first one cause lots of error and bugs – Davinder Singh Mar 06 '21 at 08:41
  • it says 'Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)'. I want to do it in dataframe that has different length. – pangu70 Mar 06 '21 at 09:17
0
  • start with a Cartesian product between df and df1
  • next find rows where b is a subset of a subset
  • use set.union() to get superset of a and b
  • build list of combined id columns
  • a bit of cleanup - dropna() and just columns you want loc[]
df = pd.DataFrame({"id":[1,2,3],
             "a":[[1,2,3],[4,5],[6]]})

df1 = pd.DataFrame({"id":[1,2,3],
             "b":[[3],[6,7],[8]]})

df2 = (df.assign(foo=1)
 .merge(df1.assign(foo=1), on="foo")
 .assign(**{"a&b":lambda dfa: np.where(dfa.apply(lambda r: any(x in r.a for x in r.b), axis=1),
                                     dfa.apply(lambda r: list(set(r.a).union(r.b)), axis=1),
                                     np.nan)})
 .dropna()
 .assign(id=lambda dfa: dfa.loc[:,["id_x","id_y"]].apply(list, axis=1))
 .loc[:,["id","a&b"]]
)

id a&b
0 [1, 1] [1, 2, 3]
7 [3, 2] [6, 7]

alternate approach

  • this eliminates the combination explosion of a Cartesian product
  • copy the list before it's exploded so it can be easily reconstructed
(df.assign(a_arr=df.a).explode("a")
 .merge(df1.assign(b_arr=df1.b).explode("b"), left_on="a", right_on="b")
 .assign(**{"id":lambda dfa: dfa.loc[:,["id_x","id_y"]].apply(list, axis=1),
           "a&b":lambda dfa: dfa.loc[:,["a_arr","b_arr"]].apply(lambda r: list(set(r.a_arr).union(r.b_arr)), axis=1)})
 .loc[:,["id","a&b"]]
)
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • It say 'Unable to allocate 26.2 GiB for an array with shape (3513454770,) and data type int64'. What happened? – pangu70 Mar 06 '21 at 09:19
  • it's the Cartesian product - it will generate `len(df) * len(df1)` rows. is there anything in your DFs that allows way to reduce number of possible combinations to be reduced? – Rob Raymond Mar 06 '21 at 09:33
  • I've updated with an alternate approach that does not do a Cartesian product. combinations will depend on size of lists – Rob Raymond Mar 06 '21 at 09:51