0

I have a pandas DataFrame like the following:

df = pd.DataFrame({
    'Name': ['John', 'Mary', 'Mark', 'Sarah', 'John', 'Sarah', 'John'],
    'Email': ['john@example.com', 'mary@example.com', 'mark@example.com', 'sarah@example.com', 'john@example.com', 'sarah@example.com', 'john@example.com'],
    'Role_Name': ['Admin', 'User', 'Admin', 'User', 'Admin', 'Admin', 'User'],
    'Last_Login': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-01', '2022-01-05', '2022-01-06'],
    'Organization_Name': ['Org1', 'Org2', 'Org1', 'Org2', 'Org2', 'Org2', 'Org3'],
    'Created_By_Name': ['SuperUser', 'SuperUser', 'SuperUser2', 'SuperUser2', 'SuperUser', 'SuperUser', 'Eve'],
    'Modified_By_Name': ['SuperUser', 'SuperUser', 'SuperUser2', 'SuperUser2', 'SuperUser', 'SuperUser2', 'SuperUser2']
})

This DataFrame includes information about various users. Some users appear more than once with the same role, but with access to a different establishment.

I would like to modify this DataFrame to have a row for each unique combination of Email and Role_Name (for example there would be two rows for John - one for the Admin role, and one for the User role). Further to this, I would like the Organization_Name column to become One-Hot Encoded so that for each establishment there is a column with 1s (this individual has access to this establishment) and 0s (this individual does not have access to this establishment).

The final DataFrame should look like this:

df = pd.DataFrame({
    'Name': ['John', 'Mary', 'Mark', 'Sarah', 'Sarah', 'John'],
    'Email': ['john@example.com', 'mary@example.com', 'mark@example.com', 'sarah@example.com', 'sarah@example.com', 'john@example.com'],
    'Role_Name': ['Admin', 'User', 'Admin', 'User', 'Admin', 'User'],
    'Last_Login': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
    'Created_By_Name': ['SuperUser', 'SuperUser', 'SuperUser2', 'SuperUser2', 'SuperUser', 'SuperUser2'],
    'Modified_By_Name': ['SuperUser', 'SuperUser', 'SuperUser2', 'SuperUser2', 'SuperUser2', 'SuperUser2'],
    'Org1': [1, 0, 1, 0, 0, 0],
    'Org2': [1, 1, 0, 1, 1, 0],
    'Org3': [0, 0, 0, 0, 0, 1]
})

Are there any suggestions on how to do this?

I have tried using the pivot_table functionality for this, but I am not succeeding.

This is different to the following: Pandas: Get Dummies As you can see from my example output, I require that a given row can have more than 1 Organization. For example, as mentioned in my original post, John should have 2 rows, not 3. Perhaps One-Hot is the wrong term.

Cyber
  • 132
  • 1
  • 7

1 Answers1

0

join a crosstab while removing the column with pop:

df = df.join(pd.crosstab(df.index, df.pop('Organization_Name')))

Or, if you only have one match per index, get_dummies

df = df.join(pd.get_dummies(df.index, df.pop('Organization_Name')))

Output:

    Name              Email Role_Name  Last_Login Created_By_Name Modified_By_Name  Org1  Org2  Org3
0   John   john@example.com     Admin  2022-01-01       SuperUser        SuperUser     1     0     0
1   Mary   mary@example.com      User  2022-01-02       SuperUser        SuperUser     0     1     0
2   Mark   mark@example.com     Admin  2022-01-03      SuperUser2       SuperUser2     1     0     0
3  Sarah  sarah@example.com      User  2022-01-04      SuperUser2       SuperUser2     0     1     0
4   John   john@example.com     Admin  2022-01-01       SuperUser        SuperUser     0     1     0
5  Sarah  sarah@example.com     Admin  2022-01-05       SuperUser       SuperUser2     0     1     0
6   John   john@example.com      User  2022-01-06             Eve       SuperUser2     0     0     1
mozway
  • 194,879
  • 13
  • 39
  • 75