I have two CSV's that I would like to merge, one looks like this (with many more customers)
The tables and csv's below have changed from the original versions for the example in the "follow up question." Stack Overflow keeps not accepting the edits to the Profiles and Events Tables, but the csv's below are correct.
Profiles Table (df_profiles
)
ID | Name | |
---|---|---|
12 | Mike | Mike@AOL.com |
78 | Jane | Jane@AOL.com |
And the other looks like this (with many more events)
Events Table (df_events
)
ID | Name | Event Type | Event Date |
---|---|---|---|
12 | Mike | upgrade | 10/02/20 |
78 | Jane | upgrade | 9/18/20 |
12 | Mike | cancel | 10/30/20 |
12 | Mike | upgrade | 11/25/20 |
78 | Jane | cancel | 11/14/20 |
In the end, I would like the final CSV export to look like this (with more customers and events)
ID | Name | Event Type 1 | Event Date 1 | Event Type 2 | Event Date 2 | Event Type 3 | Event Date 3 | Event Type 4 | Event Date 4 | |
---|---|---|---|---|---|---|---|---|---|---|
12 | Mike | Mike@AOL.com | upgrade | 10/2/2020 | cancel | 10/30/2020 | upgrade | 11/25/2020 | cancel | 12/03/2020 |
78 | Jane | Jane@AOL.com | upgrade | 9/18/2020 | upgrade | 10/30/2020 | cancel | 11/14/2020 |
This is what I have so far
import pandas as pd
df_profiles = pd.read_csv('profiles.csv')
df_events = pd.read_csv('events.csv')
df_merge = pd.merge(df_profiles[['ID', 'NAME', 'email']], df_events[['ID', 'Event Type', 'Event Date']], on='ID', how='outer')
df_duplicates = df_merge[df_merge.duplicated('ID')]
df_stitch = pd.merge(df_merge, df_duplicates[['ID', 'Event Type', 'Event Date']], on='ID')
This outputs the table like this
ID | Name | Event Type 1 | Event Date 1 | Event Type 2 | Event Date 2 | |
---|---|---|---|---|---|---|
12 | Mike | Mike@AOL.com | upgrade | 10/02/20 | cancel | 10/30/20 |
12 | Mike | Mike@AOL.com | upgrade | 10/02/20 | upgrade | 11/25/20 |
78 | Jane | Jane@AOL.com | upgrade | 9/18/20 | cancel | 11/14/20 |
So instead of adding it to the third event, it just creates another column. This is just an example and in reality many users have 10+ events, but some have only 1 or 2. I am sure the solution is just a certain loop, but I can't seem to get one that works. Any help would be greatly appreciated, thank you in advance.
File Contents are shown below:
events.csv
ID,NAME,Event Type,Event Date,email
12,Mike,upgrade,10/2/2020,Mike@AOL.com
78,Jane,upgrade,9/18/2020,Jane@AOL.com
12,Mike,cancel,10/30/2020,Mike@AOL.com
12,Mike,upgrade,11/25/2020,Mike@AOL.com
78,Jane,cancel,11/14/2020,Jane@AOL.com
12,Mike,cancel,12/03/2020,Mike@AOL.com
78,Jane,upgrade,10/30/2020,Jane@AOL.com
profiles.csv
ID,NAME,email
12,Mike,Mike@AOL.com
78,Jane,Jane@AOL.com
Follow Up Question
With the help of anky, I arrived at a solution that yielded what I was looking for, however, there is one formatting hurdle I would like to clear that will make my life easier when doing the rest of the this project.
The output table is grouping the event types all together, and sorting them by date within those groupings. For example, outputs look like this:
ID | Name | Event Type 1 | Event Date 1 | Event Type 2 | Event Date 2 | Event Type 3 | Event Date 3 | Event Type 4 | Event Date 4 | |
---|---|---|---|---|---|---|---|---|---|---|
12 | Mike | Mike@AOL.com | cancel | 10/30/20 | cancel | 12/03/20 | upgrade | 10/2/20 | upgrade | 11/25/20 |
78 | Jane | Jane@AOL.com | cancel | 11/14/20 | upgrade | 9/18/20 | upgrade | 10/30/20 |
Versus the intended output of (as shown above):
ID | Name | Event Type 1 | Event Date 1 | Event Type 2 | Event Date 2 | Event Type 3 | Event Date 3 | Event Type 4 | Event Date 4 | |
---|---|---|---|---|---|---|---|---|---|---|
12 | Mike | Mike@AOL.com | upgrade | 10/2/2020 | cancel | 10/30/2020 | upgrade | 11/25/2020 | cancel | 12/03/2020 |
78 | Jane | Jane@AOL.com | upgrade | 9/18/2020 | upgrade | 10/30/2020 | cancel | 11/14/2020 |
In order to make sure I was not messing up my date sorting, I went into the CSV itself and ordered the events correctly by date and I still got the output grouped by events rather than date. I am reposting my code below in case there I made an error.
import pandas as pd
df_profiles = pd.read_csv('profiles.csv')
df_events = pd.read_csv('events.csv')
df_events = df_events[['ID','Event Type','Event Date','NAME','email']]
df_profiles = df_profiles[['ID','NAME','email']]
events = df_events.groupby('ID')['Event Type'].apply(lambda x: x.ne(x.shift).cumsum())
out = (df_profiles.merge(df_events.drop(['NAME','email'],1).assign(Events=events),on='ID',how='outer')
.pivot_table(index=['ID','NAME','email'],columns='Events',aggfunc='first')
.sort_index(axis=1,level=1))
out.columns = out.columns.map('{0[0]} {0[1]}'.format)
out = out.reset_index()
out.to_csv('testingfinal.csv',index=False,encoding='utf-8-sig')