Creating a table in Python using values of another table as new columns

Question

I am working in Numpy and Pandas.

I have a table of loans with the features country and sector.

# LOAN	SECTOR	COUNTRY
Loan 1	food	germany
Loan 2	telecom	italy
Loan 3	auto	japan
Loan 4	food	japan
Loan 5	telecom	germany
Loan 6	auto	italy

I need to drop the duplicates by the sector and the country, ie. select the unique values of these 2 features, and use them as columns creating a table with boolean 1/0 if the loan is active in that country or sector, as follows:

# LOAN	food	telecom	auto	germany	italy	japan
Loan 1	1	0	0	1	0	0
Loan 2	0	1	0	0	1	0
Loan 3	0	0	1	0	0	1
Loan 4	1	0	0	0	0	1
Loan 5	0	1	0	1	0	0
Loan 6	0	0	1	0	1	0

So, Loan1 in the first table had food as sector and germany as country; then, in the second table it has 1 on columns food and germany and 0 on all the other columns.

It seems a pivot_table but I don't understand how I could put the 1/0 as values? Btw, what's the easiest way?

Thanks

mozway · Accepted Answer · 2023-06-05T19:06:40.497

You can use get_dummies and groupby.sum:

out = df[['# LOAN']].join(pd.get_dummies(df[['SECTOR', 'COUNTRY']].stack())
                            .groupby(level=0).sum())

NB. use .groupby(level=0).max().astype(int) if there is a chance that both columns can have the same value.

Output:

   # LOAN  auto  food  germany  italy  japan  telecom
0  Loan 1     0     1        1      0      0        0
1  Loan 2     0     0        0      1      0        1
2  Loan 3     1     0        0      0      1        0
3  Loan 4     0     1        0      0      1        0
4  Loan 5     0     0        1      0      0        1
5  Loan 6     1     0        0      1      0        0

Alternatives:

With str.get_dummies

out = df[['# LOAN']].join(df[['SECTOR', 'COUNTRY']]
                          .agg('|'.join, axis=1)
                          .str.get_dummies()
                          )

Or with crosstab

cols = ['SECTOR', 'COUNTRY']

out = (pd.concat(pd.crosstab(df['# LOAN'], df[c]) for c in cols)
         .groupby(level=0).sum().reset_index()
       )

score 3 · Answer 2 · answered Jun 05 '23 at 19:05

In cas you use sklearn, you can use OneHotEncoder and ColumnTransformer:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    [('OHE', OneHotEncoder(dtype=int), ['SECTOR', 'COUNTRY'])],
    remainder='passthrough', verbose_feature_names_out=False
)
out = pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

Output:

>>> out
  SECTOR_auto SECTOR_food SECTOR_telecom COUNTRY_germany COUNTRY_italy COUNTRY_japan  # LOAN
0           0           1              0               1             0             0  Loan 1
1           0           0              1               0             1             0  Loan 2
2           1           0              0               0             0             1  Loan 3
3           0           1              0               0             0             1  Loan 4
4           0           0              1               1             0             0  Loan 5
5           1           0              0               0             1             0  Loan 6

thanks! It works perfectly. I gave the solved flag to mozway because you can go straight with pandas without sklearn but I understand that for more complicated cases it's much better this approach. — Nassir Bin, Jun 05 '23 at 21:11

score 1 · Answer 3 · answered Jun 05 '23 at 18:55

1

What you're looking for is one-hot encoding. There's a great thread on how to get that from a pd.DataFrame() on this thread: How can I one hot encode in Python?

Cybernetic's answer was pretty thorough

Edit: mozway's answer on this thread is exactly right -- the get dummy's is the pandas fxn for one-hot encoding, I believe they used the same one in the linked thread

answered Jun 05 '23 at 18:55

Sand

198
11

1

thanks for linking the thread, very useful! – Nassir Bin Jun 05 '23 at 21:12

Creating a table in Python using values of another table as new columns

3 Answers3