Merging 3 dataframes and missing rows

Question

I am new to both Pandas and Python.

I’ve been struggling for a bit trying to collate three Panda dataFrames, but I keep getting stuck.

The dataFrames are:

users_df

RangeIndex: 22 entries, 0 to 21
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         22 non-null     object
 1   firstName  22 non-null     object
 2   lastName   22 non-null     object
dtypes: object(3)
memory usage: 660.0+ bytes

user_records_df

RangeIndex: 46 entries, 0 to 45
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   userId         46 non-null     object
 1   activityId     46 non-null     object
 2   completedDate  46 non-null     object
dtypes: object(3)
memory usage: 1.2+ KB

activities_df

RangeIndex: 13 entries, 0 to 12
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   activityId  13 non-null     object
 1   title       13 non-null     object
dtypes: object(2)
memory usage: 340.0+ bytes

Ive managed to merge and group these but not 100% successfully. The column user_records_df.userId is the first, the following columns are the activities_df.title and the data is user_records_df.completedDate

So I now have a data frame where I can see a users activity completed dates

But what is missing is all the users from the users_df. Now I have a table of users with some activity completed dates.

Just to clarify.

The user_records_df has 78 unique users (most with multiple entries, one for each activity done)
The users_df has 154 unique users.

I have tried many combinations and have resulted to now guessing. As I said I am new to Python and Pandas (about 5 days)

This is the current code

merged_records_activity_df = pd.merge(user_records_df, activities_df, left_on='activityId', right_on='activityId', how='left')

merged_records_activity_df.info()
merged_records_activity_df.head(30)

pivot_table = (merged_records_activity_df.pivot_table(index='userId', columns='title', values='completedDate', aggfunc='first')).reset_index()

# Display the pivoted table
pivot_table.head(70)

merged_records_activity_df

RangeIndex: 46 entries, 0 to 45
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   userId         46 non-null     object
 1   activityId     46 non-null     object
 2   completedDate  46 non-null     object
 3   title          46 non-null     object

Output for `pivot_table’ (not all the users from users_df are here, only those with records)

RangeIndex: 12 entries, 0 to 11
Data columns (total 8 columns):
 #   Column                                           Non-Null Count  Dtype 
---  ------                                           --------------  ----- 
 0   userId                                           12 non-null     object
 1   Welcome                                          11 non-null     object
 2   What to expect                                   11 non-null     object
 3   Preparing                                        9 non-null      object
 4   Workshop                                         8 non-null      object
 5   Behaviours                                       5 non-null      object
 6   Preparation                                      1 non-null      object
 7   Workshop                                         1 non-null      object


title    userId, Welcome, What to expect, Preparing, Workshop, Behaviours, Preparation, Workshop
    0   009f8771-afb9-413d-8832-d08b01d496e5    NaN NaN NaN NaN NaN 2020-12-11 21:48:19.529 2020-12-11 21:48:30.893
    
    
    1   0e7b00c1-ed87-44e3-8cf1-fc7096d260f4    2020-12-19 09:02:07.650 2021-01-31 23:15:07.465 2021-01-31 23:18:24.340 2021-01-31 23:18:49.695 2021-01-31 23:26:10.081 NaN NaN
    
    2   0f823f67-3443-4755-935d-34b2be94c1c0    2020-12-05 00:56:10.136 2020-12-05 00:56:46.887 2020-12-07 09:47:18.689 2020-12-07 09:47:07.788 NaN NaN NaN

The request is for help getting a dataframe with all the users from user_df, columns that are all the activity titles and then the values are the user_records_df completedDate

Example output.

[JFYI](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Danila Ganchar, Aug 24 '23 at 10:46
[why](https://meta.stackoverflow.com/questions/285551/why-should-i-not-upload-images-of-code-data-errors) — sjri, Aug 24 '23 at 11:11

score 1 · Accepted Answer · answered Aug 24 '23 at 15:34

For future questions, please provide some sample data, such as the following, to help others to answer your question. To convert a sample of your data to a dictionary, you can use df.head().to_dict().

import pandas as pd

users_df = pd.DataFrame(
    {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
     'firstName': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E'},
     'lastName': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E'}})

user_records_df = pd.DataFrame(
    {'userId': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2,
                5: 3, 6: 3, 7: 3, 8: 3, 9: 3},
     'activityId': {0: 100, 1: 101, 2: 102, 3: 100, 4: 102,
                    5: 100, 6: 101, 7: 102, 8: 103, 9: 104},
     'completedDate': {0: '21/07/2023', 1: '21/08/2023',
                       2: '24/06/2023', 3: '11/07/2023',
                       4: '04/06/2023', 5: '09/06/2023',
                       6: '23/07/2023', 7: '23/06/2023',
                       8: '30/06/2023', 9: '08/08/2023'}})

activities_df = pd.DataFrame(
    {'activityId': {0: 100, 1: 101, 2: 102, 3: 103, 4: 104, 5: 105},
     'title': {0: 'Welcome', 1: 'What to expect',
               2: 'Preparing for launch', 3: 'Launch workshop',
               4: 'No. 4', 5: 'No. 5'}})

After your initial merge and creation of pivot table (without the .reset_index(), you have the following:

# merged dataframe
m_1 = pd.merge(user_records_df, activities_df, on="activityId", how="left")

# create pivot table
pt = m_1.pivot_table(columns="title", index="userId", values="completedDate", aggfunc="first")

You can then use pd.DataFrame.reindex, which will

"Conform DataFrame to new index with optional filling logic."

This means that you can change the index to a different pd.Series or list specified. In this case, you want to change the row index to match users_df["id"] (which, importantly, is unique). If this was not unique, you would need to use something like users_df["id"].unique() instead.

Passing axis=1 allows you to change the column names, so this can be set to activities_df["title"], ensuring that these are unique again.

Note that both index and column names can be passed together using index= and columns=.

# Either both together
pt.reindex(index=users_df["id"], columns=activities_df["title"])

# or one at a time (for explanation purposes)
# change row index to be users_df["id"]
pt = pt.reindex(users_df["id"], axis=0)
# change column index to be activities_df["title"]
pt = pt.reindex(activities_df["title"], axis=1)

pt.reset_index()

pt
#Out: 
#title  id     Welcome What to expect  ... Launch workshop       No. 4 No. 5
#0       1  21/07/2023     21/08/2023  ...             NaN         NaN   NaN
#1       2  11/07/2023            NaN  ...             NaN         NaN   NaN
#2       3  09/06/2023     23/07/2023  ...      30/06/2023  08/08/2023   NaN
#3       4         NaN            NaN  ...             NaN         NaN   NaN
#4       5         NaN            NaN  ...             NaN         NaN   NaN

Thanks for that. Even bigger thanks for the explanation. df.head().to_dict(), I’ll do next time — user3067684, Aug 27 '23 at 08:53

Merging 3 dataframes and missing rows

1 Answers1