2

I have the following dataframe...

df1:
playerA   playerB  PlayerC PlayerD
kim         lee      b      f
jackson     kim      d      g
dan         lee      a      d

I want to generate a new data frame with all possible combinations of two columns. For example,

df_new:
Target   Source  
kim         lee
kim         kim
kim         lee
kim          b     
kim          d
kim          a
kim          f
kim          g
kim          d      
jackson      lee
jackson      kim
jackson      lee
jackson      b
.
.
.
.
lee         kim
lee         jackson
lee          dan
lee          b
lee          d
.
.
.

Thus, I tried this code t

import itertools
def comb(df1):
    return [df1.loc[:, list(x)].set_axis(['Target','Source'], axis=1)
            for x in itertools.combinations(df1.columns, 2)]

However, It only shows combinations between columns in the same row.

Is there any way that I could generate all the possible combination between columns? Thanks in advance!

3 Answers3

2

A way from itertools via permutations, product and chain.from_iterable:

from itertools import chain, permutations, product

df = pd.DataFrame(
         chain.from_iterable(product(df1[col_1], df1[col_2])
                             for col_1, col_2 in permutations(df1.columns, r=2)),
         columns=["Target", "Source"]
)

where we first get 2-permutations of all columns, then for each pair, form a product of their values. After doing this for all permutations, flatten them with chain.from_iterable and pass to the dataframe constructor.

I get a 108 x 2 dataframe:

      Target Source
0        kim    lee
1        kim    kim
2        kim    lee
3    jackson    lee
4    jackson    kim
..       ...    ...
103        g      d
104        g      a
105        d      b
106        d      d
107        d      a

(where 108 = 3*9*4: 3 = rows, 9 = rows * other columns, 4 = total columns).

Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
  • It is really good. Could I ask one more questions? What is "chain"? and "for col_1, col_2 in permutations(df1.columns, r=2)" means? – DAEHYUN KIM May 27 '21 at 06:34
  • @DAEHYUNKIM `chain` is used for flattening a list of lists, you can see this [answer](https://stackoverflow.com/a/953097/9332187). In this case, for example a list like `[[("kim", "lee"), ("kim", "kim")], [("jackson", "lee"), ("jackson", "kim")]]` has 2 lists inside. After `chain`, it becomes `[("kim", "lee"), ("kim", "kim"), ("jackson", "lee"), ("jackson", "kim")]` (note the disappearing `[, ]` inside, it is a single list of 4 items now). – Mustafa Aydın May 27 '21 at 06:42
  • @DAEHYUNKIM As for the `permutations`, it takes columns' names and generate 2-permuations of them. It gives `[('playerA', 'playerB'), ('playerA', 'PlayerC'), ('playerA', 'PlayerD'), ('playerB', 'playerA'), ('playerB', 'PlayerC'), ('playerB', 'PlayerD'), ('PlayerC', 'playerA'), ('PlayerC', 'playerB'), ('PlayerC', 'PlayerD'), ('PlayerD', 'playerA'), ('PlayerD', 'playerB'), ('PlayerD', 'PlayerC')]`. Then with `for col_1, col_2`, we turn a `for` loop and `col_1` and `col_2` are assigned to these pairs in each turn. – Mustafa Aydın May 27 '21 at 06:43
  • 1
    Aydin It really helps me a lot. Thanks you – DAEHYUN KIM May 27 '21 at 08:13
2

Here's an approach using pandas.DataFrame.melt() and pandas.merge()

>>> df1
   playerA playerB PlayerC PlayerD
0      kim     lee       b       f
1  jackson     kim       d       g
2      dan     lee       a       d
>>> target = df1.melt(value_name='Source')[['Source']]
>>> df_new = pd.merge(target.rename(columns={'Source':'Target'}), target, how='cross')
>>> df_new
    Target   Source
0      kim      kim
1      kim  jackson
2      kim      dan
3      kim      lee
4      kim      kim
..     ...      ...
139      d        d
140      d        a
141      d        f
142      d        g
143      d        d

This approach doesn't consider same indices of Target and Source, but you can easily drop those rows using simple math as follow:

>>> indices_to_drop = [idx * len(target) + idx for idx in range(len(target)]
>>> indices_to_drop
[0, 13, 26, 39, 52, 65, 78, 91, 104, 117, 130, 143]
>>> df_new.drop(indices_to_drop).reset_index(drop=True)
    Target   Source
0      kim  jackson
1      kim      dan
2      kim      lee
3      kim      kim
4      kim      lee
..     ...      ...
127      d        b
128      d        d
129      d        a
130      d        f
131      d        g
Gusti Adli
  • 1,225
  • 4
  • 13
  • number of rows is too much, i believe (144 > 108). – Mustafa Aydın May 27 '21 at 06:29
  • Before dropping those indices, this solution also takes product of a column with itself, which is useful i think but may not be what OP wants. After dropping those indices, however, it is somewhere in between. – Mustafa Aydın May 27 '21 at 06:55
1

You can use list comprehension.

df = pd.DataFrame(columns=["Target", "Source"])
for col in df1:
    df = pd.concat(
        [
            df,
            pd.DataFrame(
                [
                    {"Target": val1, "Source": val2}
                    for val2 in df1.loc[:, df1.columns.difference([col])].values.flatten()
                    for val1 in df1[col]
                ]
            ),
        ],
    )

df
    Target  Source
0   kim b
1   jackson b
2   dan b
3   kim f
4   jackson f
... ... ...
22  g   dan
23  d   dan
24  f   lee
25  g   lee
26  d   lee
108 rows × 2 columns
nocibambi
  • 2,065
  • 1
  • 16
  • 22