Build matrix of dummy indicators

Question

I have a pandas dataframe which looks like the following

team_id	skill_id	inventor_id
1	A	Jack
1	B	Jack
1	A	Jill
1	B	Jill
2	A	Jack
2	B	Jack
2	A	Joe
2	B	Joe

So inventors can repeat over teams. I want to turn this data frame into a matrix A (I have included column names below for clarity, they wouldn't form part of the matrix) of dummy indicators, for those example A =

Jack_A	Jack_B	Jill_A	Jill_B	Joe_A	Joe_B
1	0	1	0	0	0
0	1	0	1	0	0
1	0	0	0	1	0
0	1	0	0	0	1

So that each row corresponds to one (team_id x skill_id combination), and each entry of the matrix is equal to one for that (inventor_id x skill_id) observation.

I tried to create an array of numpy zeros and thought of a double dictionary to map from each (team_id x skill), (inventor_id x skill) combination to an A_ij entry. However I believe this cannot be the most efficient method.

I need the method to be memory efficient as I have 220,000 (inventor x team x skill) observations. (So the dimension of the real df is (220,000, 3), not (8, 3) as in the example.

You're looking for a multihead one-hot encoding from what I understand, check this answer https://stackoverflow.com/a/24044958/3275464 — Learning is a mess, Feb 15 '22 at 16:10

score 1 · Accepted Answer · answered Feb 15 '22 at 17:38

In addition to @Ben.T 's great answer I figured out another which allows me to keep memory efficient.

# Set the identifier for each row
inventor_data["team_id"] = inventor_data["team_id"].astype(str)

inventor_data["inv_skill_id"] = inventor_data["inventor_id"] +    inventor_data["skill_id"]

inventor_data["team_skill_id"] = inventor_data["team_id"] + inventor_data["skill_id"]

# Using DictVectorizer requires a dictionary input
teams = list(inventor_data.groupby('team_skill_id')['inv_skill_id'].agg(dict))

# Change the dict entry from count to 1 
for team_id, team in enumerate(teams):

    teams[team_id] = {v: 1 for k, v in team.items()} 

from sklearn.feature_extraction import DictVectorizer

vectoriser = DictVectorizer(sparse=False)
X = vectoriser.fit_transform(teams)

Ben.T · Answer 2 · 2022-02-15T16:29:04.393

0

IIUC, you can use crosstab:

print(
    pd.crosstab(
        index=[df['team_id'],df['skill_id']],
         columns=[df['inventor_id'], df['skill_id']]
    )#.to_numpy()
)
# inventor_id      Jack    Jill    Joe   
# skill_id            A  B    A  B   A  B
# team_id skill_id                       
# 1       A           1  0    1  0   0  0
#         B           0  1    0  1   0  0
# 2       A           1  0    0  0   1  0
#         B           0  1    0  0   0  1

and if you just want the matrix, then uncomment .to_numpy() in the above code.

Note: if you have some skills that are not shared between teams or inventors, you may need to reindex with all the possibilities, so do:

pd.crosstab(
    index=[df['team_id'],df['skill_id']],
    columns=[df['inventor_id'], df['skill_id']]
).reindex(
    index=pd.MultiIndex.from_product(
            [df['team_id'].unique(),df['skill_id'].unique()]),
    columns=pd.MultiIndex.from_product(
            [df['inventor_id'].unique(),df['skill_id'].unique()]),
    fill_value=0
)#.to_numpy()

edited Feb 15 '22 at 16:29

answered Feb 15 '22 at 16:17

Ben.T

29,160
6
32
54

This is great, thank you! But I should have been clearer in my question that I need this to be memory efficient, in my data set my DataFrame contains 221,000 (inventor x team x skill) observations, so this blows up. How could I make this method more memory efficient, can I store it directly as a sparse matrix? – Joe Emmens Feb 15 '22 at 16:54
@JoeEmmens to my knowledge, crosstab does not have a sparse possibility, and not sure how to `reindex` in this case. can you give more info on the size. do you have 221K rows in your df? or is it all the possibilities inventor X skill X team? – Ben.T Feb 15 '22 at 17:03
The rows are uniquely identified by ["team_id", "inventor_id", "skill_id"] and I have 221K rows. I have 15K unique team_ids and 13K unique inventor ids, so the final matrix will have dimensions ((num_teams * num_skills), (num_invs * num_skills)) = ((15,000 * 8),(13000*8)) = (120000, 104000). – Joe Emmens Feb 15 '22 at 17:09
I figured out a method using sklearn's DictVectorizer instead. I think that should satisfy the need to keep memory use low. – Joe Emmens Feb 15 '22 at 17:39

Build matrix of dummy indicators

2 Answers2