I am new to both Pandas and Python.
I’ve been struggling for a bit trying to collate three Panda dataFrames, but I keep getting stuck.
The dataFrames are:
users_df
RangeIndex: 22 entries, 0 to 21
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 22 non-null object
1 firstName 22 non-null object
2 lastName 22 non-null object
dtypes: object(3)
memory usage: 660.0+ bytes
user_records_df
RangeIndex: 46 entries, 0 to 45
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userId 46 non-null object
1 activityId 46 non-null object
2 completedDate 46 non-null object
dtypes: object(3)
memory usage: 1.2+ KB
activities_df
RangeIndex: 13 entries, 0 to 12
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 activityId 13 non-null object
1 title 13 non-null object
dtypes: object(2)
memory usage: 340.0+ bytes
Ive managed to merge and group these but not 100% successfully. The column user_records_df.userId
is the first, the following columns are the activities_df.title
and the data is user_records_df.completedDate
So I now have a data frame where I can see a users activity completed dates
But what is missing is all the users from the users_df. Now I have a table of users with some activity completed dates.
Just to clarify.
- The user_records_df has 78 unique users (most with multiple entries, one for each activity done)
- The users_df has 154 unique users.
I have tried many combinations and have resulted to now guessing. As I said I am new to Python and Pandas (about 5 days)
This is the current code
merged_records_activity_df = pd.merge(user_records_df, activities_df, left_on='activityId', right_on='activityId', how='left')
merged_records_activity_df.info()
merged_records_activity_df.head(30)
pivot_table = (merged_records_activity_df.pivot_table(index='userId', columns='title', values='completedDate', aggfunc='first')).reset_index()
# Display the pivoted table
pivot_table.head(70)
merged_records_activity_df
RangeIndex: 46 entries, 0 to 45
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userId 46 non-null object
1 activityId 46 non-null object
2 completedDate 46 non-null object
3 title 46 non-null object
Output for `pivot_table’ (not all the users from users_df are here, only those with records)
RangeIndex: 12 entries, 0 to 11
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userId 12 non-null object
1 Welcome 11 non-null object
2 What to expect 11 non-null object
3 Preparing 9 non-null object
4 Workshop 8 non-null object
5 Behaviours 5 non-null object
6 Preparation 1 non-null object
7 Workshop 1 non-null object
title userId, Welcome, What to expect, Preparing, Workshop, Behaviours, Preparation, Workshop
0 009f8771-afb9-413d-8832-d08b01d496e5 NaN NaN NaN NaN NaN 2020-12-11 21:48:19.529 2020-12-11 21:48:30.893
1 0e7b00c1-ed87-44e3-8cf1-fc7096d260f4 2020-12-19 09:02:07.650 2021-01-31 23:15:07.465 2021-01-31 23:18:24.340 2021-01-31 23:18:49.695 2021-01-31 23:26:10.081 NaN NaN
2 0f823f67-3443-4755-935d-34b2be94c1c0 2020-12-05 00:56:10.136 2020-12-05 00:56:46.887 2020-12-07 09:47:18.689 2020-12-07 09:47:07.788 NaN NaN NaN
The request is for help getting a dataframe with all the users from user_df
, columns that are all the activity titles and then the values are the user_records_df completedDate
Example output.