I have a pandas dataframe which looks like the following
team_id | skill_id | inventor_id |
---|---|---|
1 | A | Jack |
1 | B | Jack |
1 | A | Jill |
1 | B | Jill |
2 | A | Jack |
2 | B | Jack |
2 | A | Joe |
2 | B | Joe |
So inventors can repeat over teams. I want to turn this data frame into a matrix A (I have included column names below for clarity, they wouldn't form part of the matrix) of dummy indicators, for those example A =
Jack_A | Jack_B | Jill_A | Jill_B | Joe_A | Joe_B |
---|---|---|---|---|---|
1 | 0 | 1 | 0 | 0 | 0 |
0 | 1 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 0 | 1 |
So that each row corresponds to one (team_id x skill_id combination), and each entry of the matrix is equal to one for that (inventor_id x skill_id) observation.
I tried to create an array of numpy zeros and thought of a double dictionary to map from each (team_id x skill), (inventor_id x skill) combination to an A_ij entry. However I believe this cannot be the most efficient method.
I need the method to be memory efficient as I have 220,000 (inventor x team x skill) observations. (So the dimension of the real df is (220,000, 3), not (8, 3) as in the example.