0

I have a pandas dataframe which looks like the following

team_id skill_id inventor_id
1 A Jack
1 B Jack
1 A Jill
1 B Jill
2 A Jack
2 B Jack
2 A Joe
2 B Joe

So inventors can repeat over teams. I want to turn this data frame into a matrix A (I have included column names below for clarity, they wouldn't form part of the matrix) of dummy indicators, for those example A =

Jack_A Jack_B Jill_A Jill_B Joe_A Joe_B
1 0 1 0 0 0
0 1 0 1 0 0
1 0 0 0 1 0
0 1 0 0 0 1

So that each row corresponds to one (team_id x skill_id combination), and each entry of the matrix is equal to one for that (inventor_id x skill_id) observation.

I tried to create an array of numpy zeros and thought of a double dictionary to map from each (team_id x skill), (inventor_id x skill) combination to an A_ij entry. However I believe this cannot be the most efficient method.

I need the method to be memory efficient as I have 220,000 (inventor x team x skill) observations. (So the dimension of the real df is (220,000, 3), not (8, 3) as in the example.

Joe Emmens
  • 135
  • 13

2 Answers2

1

In addition to @Ben.T 's great answer I figured out another which allows me to keep memory efficient.

# Set the identifier for each row
inventor_data["team_id"] = inventor_data["team_id"].astype(str)

inventor_data["inv_skill_id"] = inventor_data["inventor_id"] +    inventor_data["skill_id"]

inventor_data["team_skill_id"] = inventor_data["team_id"] + inventor_data["skill_id"]

# Using DictVectorizer requires a dictionary input
teams = list(inventor_data.groupby('team_skill_id')['inv_skill_id'].agg(dict))

# Change the dict entry from count to 1 
for team_id, team in enumerate(teams):

    teams[team_id] = {v: 1 for k, v in team.items()} 

from sklearn.feature_extraction import DictVectorizer

vectoriser = DictVectorizer(sparse=False)
X = vectoriser.fit_transform(teams)
Joe Emmens
  • 135
  • 13
0

IIUC, you can use crosstab:

print(
    pd.crosstab(
        index=[df['team_id'],df['skill_id']],
         columns=[df['inventor_id'], df['skill_id']]
    )#.to_numpy()
)
# inventor_id      Jack    Jill    Joe   
# skill_id            A  B    A  B   A  B
# team_id skill_id                       
# 1       A           1  0    1  0   0  0
#         B           0  1    0  1   0  0
# 2       A           1  0    0  0   1  0
#         B           0  1    0  0   0  1

and if you just want the matrix, then uncomment .to_numpy() in the above code.

Note: if you have some skills that are not shared between teams or inventors, you may need to reindex with all the possibilities, so do:

pd.crosstab(
    index=[df['team_id'],df['skill_id']],
    columns=[df['inventor_id'], df['skill_id']]
).reindex(
    index=pd.MultiIndex.from_product(
            [df['team_id'].unique(),df['skill_id'].unique()]),
    columns=pd.MultiIndex.from_product(
            [df['inventor_id'].unique(),df['skill_id'].unique()]),
    fill_value=0
)#.to_numpy()
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • This is great, thank you! But I should have been clearer in my question that I need this to be memory efficient, in my data set my DataFrame contains 221,000 (inventor x team x skill) observations, so this blows up. How could I make this method more memory efficient, can I store it directly as a sparse matrix? – Joe Emmens Feb 15 '22 at 16:54
  • @JoeEmmens to my knowledge, crosstab does not have a sparse possibility, and not sure how to `reindex` in this case. can you give more info on the size. do you have 221K rows in your df? or is it all the possibilities inventor X skill X team? – Ben.T Feb 15 '22 at 17:03
  • The rows are uniquely identified by ["team_id", "inventor_id", "skill_id"] and I have 221K rows. I have 15K unique team_ids and 13K unique inventor ids, so the final matrix will have dimensions ((num_teams * num_skills), (num_invs * num_skills)) = ((15,000 * 8),(13000*8)) = (120000, 104000). – Joe Emmens Feb 15 '22 at 17:09
  • I figured out a method using sklearn's DictVectorizer instead. I think that should satisfy the need to keep memory use low. – Joe Emmens Feb 15 '22 at 17:39