I have a pandas DataFrame like the following:
df = pd.DataFrame({
'Name': ['John', 'Mary', 'Mark', 'Sarah', 'John', 'Sarah', 'John'],
'Email': ['john@example.com', 'mary@example.com', 'mark@example.com', 'sarah@example.com', 'john@example.com', 'sarah@example.com', 'john@example.com'],
'Role_Name': ['Admin', 'User', 'Admin', 'User', 'Admin', 'Admin', 'User'],
'Last_Login': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-01', '2022-01-05', '2022-01-06'],
'Organization_Name': ['Org1', 'Org2', 'Org1', 'Org2', 'Org2', 'Org2', 'Org3'],
'Created_By_Name': ['SuperUser', 'SuperUser', 'SuperUser2', 'SuperUser2', 'SuperUser', 'SuperUser', 'Eve'],
'Modified_By_Name': ['SuperUser', 'SuperUser', 'SuperUser2', 'SuperUser2', 'SuperUser', 'SuperUser2', 'SuperUser2']
})
This DataFrame includes information about various users. Some users appear more than once with the same role, but with access to a different establishment.
I would like to modify this DataFrame to have a row for each unique combination of Email and Role_Name (for example there would be two rows for John - one for the Admin role, and one for the User role). Further to this, I would like the Organization_Name column to become One-Hot Encoded so that for each establishment there is a column with 1s (this individual has access to this establishment) and 0s (this individual does not have access to this establishment).
The final DataFrame should look like this:
df = pd.DataFrame({
'Name': ['John', 'Mary', 'Mark', 'Sarah', 'Sarah', 'John'],
'Email': ['john@example.com', 'mary@example.com', 'mark@example.com', 'sarah@example.com', 'sarah@example.com', 'john@example.com'],
'Role_Name': ['Admin', 'User', 'Admin', 'User', 'Admin', 'User'],
'Last_Login': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06'],
'Created_By_Name': ['SuperUser', 'SuperUser', 'SuperUser2', 'SuperUser2', 'SuperUser', 'SuperUser2'],
'Modified_By_Name': ['SuperUser', 'SuperUser', 'SuperUser2', 'SuperUser2', 'SuperUser2', 'SuperUser2'],
'Org1': [1, 0, 1, 0, 0, 0],
'Org2': [1, 1, 0, 1, 1, 0],
'Org3': [0, 0, 0, 0, 0, 1]
})
Are there any suggestions on how to do this?
I have tried using the pivot_table
functionality for this, but I am not succeeding.
This is different to the following: Pandas: Get Dummies As you can see from my example output, I require that a given row can have more than 1 Organization. For example, as mentioned in my original post, John should have 2 rows, not 3. Perhaps One-Hot is the wrong term.