1

I am working in Numpy and Pandas.

I have a table of loans with the features country and sector.

# LOAN SECTOR COUNTRY
Loan 1 food germany
Loan 2 telecom italy
Loan 3 auto japan
Loan 4 food japan
Loan 5 telecom germany
Loan 6 auto italy

I need to drop the duplicates by the sector and the country, ie. select the unique values of these 2 features, and use them as columns creating a table with boolean 1/0 if the loan is active in that country or sector, as follows:

# LOAN food telecom auto germany italy japan
Loan 1 1 0 0 1 0 0
Loan 2 0 1 0 0 1 0
Loan 3 0 0 1 0 0 1
Loan 4 1 0 0 0 0 1
Loan 5 0 1 0 1 0 0
Loan 6 0 0 1 0 1 0

So, Loan1 in the first table had food as sector and germany as country; then, in the second table it has 1 on columns food and germany and 0 on all the other columns.

It seems a pivot_table but I don't understand how I could put the 1/0 as values? Btw, what's the easiest way?

Thanks

Nassir Bin
  • 39
  • 6

3 Answers3

3

You can use get_dummies and groupby.sum:

out = df[['# LOAN']].join(pd.get_dummies(df[['SECTOR', 'COUNTRY']].stack())
                            .groupby(level=0).sum())

NB. use .groupby(level=0).max().astype(int) if there is a chance that both columns can have the same value.

Output:

   # LOAN  auto  food  germany  italy  japan  telecom
0  Loan 1     0     1        1      0      0        0
1  Loan 2     0     0        0      1      0        1
2  Loan 3     1     0        0      0      1        0
3  Loan 4     0     1        0      0      1        0
4  Loan 5     0     0        1      0      0        1
5  Loan 6     1     0        0      1      0        0

Alternatives:

With str.get_dummies

out = df[['# LOAN']].join(df[['SECTOR', 'COUNTRY']]
                          .agg('|'.join, axis=1)
                          .str.get_dummies()
                          )

Or with crosstab

cols = ['SECTOR', 'COUNTRY']

out = (pd.concat(pd.crosstab(df['# LOAN'], df[c]) for c in cols)
         .groupby(level=0).sum().reset_index()
       )
mozway
  • 194,879
  • 13
  • 39
  • 75
3

In cas you use sklearn, you can use OneHotEncoder and ColumnTransformer:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(
    [('OHE', OneHotEncoder(dtype=int), ['SECTOR', 'COUNTRY'])],
    remainder='passthrough', verbose_feature_names_out=False
)
out = pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

Output:

>>> out
  SECTOR_auto SECTOR_food SECTOR_telecom COUNTRY_germany COUNTRY_italy COUNTRY_japan  # LOAN
0           0           1              0               1             0             0  Loan 1
1           0           0              1               0             1             0  Loan 2
2           1           0              0               0             0             1  Loan 3
3           0           1              0               0             0             1  Loan 4
4           0           0              1               1             0             0  Loan 5
5           1           0              0               0             1             0  Loan 6
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • thanks! It works perfectly. I gave the solved flag to mozway because you can go straight with pandas without sklearn but I understand that for more complicated cases it's much better this approach. – Nassir Bin Jun 05 '23 at 21:11
1

What you're looking for is one-hot encoding. There's a great thread on how to get that from a pd.DataFrame() on this thread: How can I one hot encode in Python?

Cybernetic's answer was pretty thorough

Edit: mozway's answer on this thread is exactly right -- the get dummy's is the pandas fxn for one-hot encoding, I believe they used the same one in the linked thread

Sand
  • 198
  • 11