2

I have two CSV's that I would like to merge, one looks like this (with many more customers)

The tables and csv's below have changed from the original versions for the example in the "follow up question." Stack Overflow keeps not accepting the edits to the Profiles and Events Tables, but the csv's below are correct.

Profiles Table (df_profiles)

ID Name Email
12 Mike Mike@AOL.com
78 Jane Jane@AOL.com

And the other looks like this (with many more events)

Events Table (df_events)

ID Name Event Type Event Date
12 Mike upgrade 10/02/20
78 Jane upgrade 9/18/20
12 Mike cancel 10/30/20
12 Mike upgrade 11/25/20
78 Jane cancel 11/14/20

In the end, I would like the final CSV export to look like this (with more customers and events)

ID Name Email Event Type 1 Event Date 1 Event Type 2 Event Date 2 Event Type 3 Event Date 3 Event Type 4 Event Date 4
12 Mike Mike@AOL.com upgrade 10/2/2020 cancel 10/30/2020 upgrade 11/25/2020 cancel 12/03/2020
78 Jane Jane@AOL.com upgrade 9/18/2020 upgrade 10/30/2020 cancel 11/14/2020

This is what I have so far

    import pandas as pd

    df_profiles = pd.read_csv('profiles.csv')
    df_events = pd.read_csv('events.csv')

    df_merge = pd.merge(df_profiles[['ID', 'NAME', 'email']], df_events[['ID', 'Event Type', 'Event Date']], on='ID', how='outer')
    df_duplicates = df_merge[df_merge.duplicated('ID')]
    df_stitch = pd.merge(df_merge, df_duplicates[['ID', 'Event Type', 'Event Date']], on='ID')

This outputs the table like this

ID Name Email Event Type 1 Event Date 1 Event Type 2 Event Date 2
12 Mike Mike@AOL.com upgrade 10/02/20 cancel 10/30/20
12 Mike Mike@AOL.com upgrade 10/02/20 upgrade 11/25/20
78 Jane Jane@AOL.com upgrade 9/18/20 cancel 11/14/20

So instead of adding it to the third event, it just creates another column. This is just an example and in reality many users have 10+ events, but some have only 1 or 2. I am sure the solution is just a certain loop, but I can't seem to get one that works. Any help would be greatly appreciated, thank you in advance.

File Contents are shown below:

events.csv

ID,NAME,Event Type,Event Date,email
12,Mike,upgrade,10/2/2020,Mike@AOL.com
78,Jane,upgrade,9/18/2020,Jane@AOL.com
12,Mike,cancel,10/30/2020,Mike@AOL.com
12,Mike,upgrade,11/25/2020,Mike@AOL.com
78,Jane,cancel,11/14/2020,Jane@AOL.com
12,Mike,cancel,12/03/2020,Mike@AOL.com
78,Jane,upgrade,10/30/2020,Jane@AOL.com

profiles.csv

ID,NAME,email
12,Mike,Mike@AOL.com
78,Jane,Jane@AOL.com

Follow Up Question

With the help of anky, I arrived at a solution that yielded what I was looking for, however, there is one formatting hurdle I would like to clear that will make my life easier when doing the rest of the this project.

The output table is grouping the event types all together, and sorting them by date within those groupings. For example, outputs look like this:

ID Name Email Event Type 1 Event Date 1 Event Type 2 Event Date 2 Event Type 3 Event Date 3 Event Type 4 Event Date 4
12 Mike Mike@AOL.com cancel 10/30/20 cancel 12/03/20 upgrade 10/2/20 upgrade 11/25/20
78 Jane Jane@AOL.com cancel 11/14/20 upgrade 9/18/20 upgrade 10/30/20

Versus the intended output of (as shown above):

ID Name Email Event Type 1 Event Date 1 Event Type 2 Event Date 2 Event Type 3 Event Date 3 Event Type 4 Event Date 4
12 Mike Mike@AOL.com upgrade 10/2/2020 cancel 10/30/2020 upgrade 11/25/2020 cancel 12/03/2020
78 Jane Jane@AOL.com upgrade 9/18/2020 upgrade 10/30/2020 cancel 11/14/2020

In order to make sure I was not messing up my date sorting, I went into the CSV itself and ordered the events correctly by date and I still got the output grouped by events rather than date. I am reposting my code below in case there I made an error.

import pandas as pd

df_profiles = pd.read_csv('profiles.csv')
df_events = pd.read_csv('events.csv')
df_events = df_events[['ID','Event Type','Event Date','NAME','email']]
df_profiles = df_profiles[['ID','NAME','email']]

events = df_events.groupby('ID')['Event Type'].apply(lambda x: x.ne(x.shift).cumsum())

out = (df_profiles.merge(df_events.drop(['NAME','email'],1).assign(Events=events),on='ID',how='outer')
        .pivot_table(index=['ID','NAME','email'],columns='Events',aggfunc='first')
        .sort_index(axis=1,level=1))

out.columns = out.columns.map('{0[0]} {0[1]}'.format)
out = out.reset_index()

out.to_csv('testingfinal.csv',index=False,encoding='utf-8-sig')
jlambert
  • 23
  • 6
  • Please don't post images of code, data, or Tracebacks. Copy and paste it as text then format it as code (select it and type `ctrl-k`) ... [Discourage screenshots of code and/or errors](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors)...[Why not upload images of code on SO when asking a question?](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question) ... [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – wwii Dec 29 '20 at 17:33
  • 1
    @wwii I do not see any screen shots. There are no `JPEG` or `PNG` files embedded in the post. Go view the source code. All of the tables were created using stack overflow mark-down language. – Toothpick Anemone Dec 29 '20 at 17:44
  • First SO post, so apologies if I messed up formatting. But there are no images in the post. I used `ctrl-k` and copy/pasted into the "insert code here". Any tips on how I should have formatted it instead @wwii? – jlambert Dec 29 '20 at 17:44
  • @jlambert do not listen to [@wwii](https://stackoverflow.com/users/2823755/wwii). Your stack overflow post is definitely in the top 25% in terms of good formatting. You also provided short examples of simple test cases, etc... I do have no idea why [wwii](https://stackoverflow.com/users/2823755/wwii) dislikes your question. – Toothpick Anemone Dec 29 '20 at 17:48
  • My bad, they looked like images. – wwii Dec 29 '20 at 18:50

1 Answers1

2

You can use the below to dynamically transpose event types and dates. This uses a helper column which groups on ID and crates an index, then we pivot using df.pivot_table.

events = df_events.groupby("ID")['Event Type'].apply(lambda x: x.ne(x.shift).cumsum())

out = (df_profiles.merge(df_events.drop("Name",1).assign(Events=events),on='ID')
        .pivot_table(index=['ID','Name','Email'],columns='Events',aggfunc='first')
        .sort_index(axis=1,level=1))

out.columns = out.columns.map('{0[0]} {0[1]}'.format)
out = out.reset_index()

print(out)
ID Name Email Event Date 1 Event Type 1 Event Date 2 Event Type 2 Event Date 3 Event Type 3
0 12 Mike Mike@AOL.com 10/02/20 upgrade 10/30/20 cancel 11/25/20 upgrade
1 78 Jane Jane@AOL.com 9/18/20 upgrade 11/14/20 cancel NaN NaN

EDIT:

As per edited question we could try converting the date field to datetime and fo an argsort , then use the same code:

u = df_events.loc[pd.to_datetime(df_events['Event Date'],format='%m/%d/%Y').argsort()]

events = u.groupby('ID')['Event Type'].apply(lambda x: x.ne(x.shift).cumsum())

out = (df_profiles.merge(u.drop(['NAME','email'],1).assign(Events=events),
         on='ID',how='outer')
        .pivot_table(index=['ID','NAME','email'],columns='Events',aggfunc='first')
        .sort_index(axis=1,level=1))

out.columns = out.columns.map('{0[0]} {0[1]}'.format)
out = out.reset_index()
print(out)
ID NAME email Event Date 1 Event Type 1 Event Date 2 Event Type 2 Event Date 3 Event Type 3 Event Date 4 Event Type 4
0 12 Mike Mike@AOL.com 10/2/2020 upgrade 10/30/2020 cancel 11/25/2020 upgrade 12/03/2020 cancel
1 78 Jane Jane@AOL.com 9/18/2020 upgrade 10/30/2020 upgrade 11/14/2020 cancel NaN NaN
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    This is exactly what I was looking for, thank you so much. – jlambert Dec 29 '20 at 18:18
  • Sorry to ask another question in this thread, and if I should start a new one, let me know: But is there a way to make the Event Dates go in order? It keeps grouping it so all of the cancels are first then all of the upgrades. I tried sorting the dates in the events.csv first, but I still get this grouping. – jlambert Dec 29 '20 at 19:51
  • The dates are already in datetime. I'll edit the question to include this as well with examples. Whenever you get around to it is great. In the meantime I'll keep at it and update it if I find a solution. Thank you for all of your help. – jlambert Dec 29 '20 at 20:01
  • @jlambert My pleasure!! Glad I could help. :) Happy coding. Happy new year – anky Dec 30 '20 at 17:07