merging rows in csv where row[0] is a duplicate

Question

I have a csv file that has student_id,guardian_email,guardian_first_name,guardian_last_name

In many cases, the student has a mom and dad with info, so the student has more than one row in the csv file.

For example the original csv would look like this:

student_id,guardian_email,guardian_first_name,guardian_last_name
12345,momemail@google.com,Jane,Doe
12345,dademail@google.com,John,Doe
98765,coollady@yahoo.com,Mary,Poppins
99999,soccermom@bing.net,Laura,Croft
99999,blackbelt@karate.com,Chuck,Norris

using python, I want it to output this:

student_id,guardian_email,guardian_first_name,guardian_last_name,guardian_email2,guardian_first_name2,guardian_last_name2
12345,momemail@google.com,Jane,Doe,dademail@google.com,John,Doe
98765,coollady@yahoo.com,Mary,Poppins,,,
99999,soccermom@bing.net,Laura,Croft,blackbelt@karate.com,Chuck,Norris

Any help is greatly appreciated!

Anurag Dabas · Accepted Answer · 2021-07-31T15:42:04.687

0

use groupby()+cumcount() to track position and then pivot():

df['s']=df.groupby('student_id').cumcount()+1
df=df.pivot('student_id','s',['guardian_email','guardian_first_name','guardian_last_name'])
df.columns=[f"{x}_{y}" for x,y in df.columns]
df=df.sort_index(axis=1).reset_index()

OR

use groupby()+cumcount() to track position and then unstack()

df=df.assign(s=df.groupby('student_id').cumcount()+1).set_index(['student_id','s']).unstack()
df.columns=[f"{x}_{y}" for x,y in df.columns]
df=df.sort_index(axis=1).reset_index()

Now If you print df you will get your expected output

Update:

try:

def guardianemailfinal():
    path=r'C:\Users\sftp\PS\IMPORTED\pythonscripts\Major-Clarity\files\guardian_email.csv'
    df=pd.read_csv(path,sep=',')
    df['s']=df.groupby('student_id').cumcount()+1
    df=df.pivot('student_id','s',['guardian_email','guardian_first_name','guardian_last_name'])
    df.columns=[f"{x}_{y}" for x,y in df.columns]
    df=df.sort_index(axis=1).reset_index()
    df.to_csv(r'C:\Users\sftp\PS\IMPORTED\pythonscripts\Major-Clarity\files\output.csv',index=False,sep=',')
    return df

#Finally call the function:
df=guardianemailfinal()

Note: Now If you print df you will get the modified dataframe and check your path then you will get 'output.csv' file

edited Jul 31 '21 at 15:42

answered Jul 31 '21 at 06:12

Anurag Dabas

23,866
9
21
41

Thanks so much for your reply! I tried implementing this and I am getting an error: UnboundLocalError: df['s']=df.groupby('student_id').cumcount()+1 UnboundLocalError: local variable 'df' referenced before assignment......... Any ideas on what I am doing wrong? – Joseph Sanders Jul 31 '21 at 14:44
@JosephSanders Sir you are getting this error in both of the methods? or are you using the above methods inside the function? – Anurag Dabas Jul 31 '21 at 14:45
Im using the first example(pivot) inside the function that creates the csv file – Joseph Sanders Jul 31 '21 at 14:50
@JosephSanders so that's the reason sir for that use global or pass parameter to your function pls see this [unboundlocalerror-on-local-variable-when-reassigned-after-first-use](https://stackoverflow.com/questions/370357/unboundlocalerror-on-local-variable-when-reassigned-after-first-use) – Anurag Dabas Jul 31 '21 at 14:54
def guardianemailfinal(): global df with open(r'C:\Users\sftp\PS\IMPORTED\pythonscripts\Major-Clarity\files\guardian_email.csv') as csv_file: csv_reader = csv.reader(csv_file, delimiter=',') df['s']=df.groupby('student_id').cumcount()+1 df=df.pivot('student_id','s',['guardian_email','guardian_first_name','guardian_last_name']) df.columns=[f"{x}_{y}" for x,y in df.columns] df=df.sort_index(axis=1).reset_index() print (df) – Joseph Sanders Jul 31 '21 at 15:02
the above is what I have, I had to create a new function because it was erasing the data from the csv in the other function.... when I put global df at the beginning of the function it now says df is not defined. Apparantly I am setting this entire function up wrong. All i want to do is read the csv file, and then output either into a new csv or into the same csv, the info as originally shown and I am failing miserably here – Joseph Sanders Jul 31 '21 at 15:04
@JosephSanders sir updated answer...kindly check if it works for you or not! – Anurag Dabas Jul 31 '21 at 15:15
thanks for your help thus far. I did that and this time I get no errors, but the csv file does not get updated with anything at all. I used print(df) and what I see on the screen doesnt exactly look correct though. – Joseph Sanders Jul 31 '21 at 15:28
@JosephSanders so sir you want to update your csv in which mode?append or write?It is just returning you the df after reshaping your data if you want to save this data in csv file then pls tell me the mode I will update the answer **:)** – Anurag Dabas Jul 31 '21 at 15:30
Thanks! It would probably be easier to just output a new csv file named "output.csv" at location r'C:\Users\sftp\PS\IMPORTED\pythonscripts\Major-Clarity\files\' – Joseph Sanders Jul 31 '21 at 15:36
@JosephSanders Sir updated answer..kindly have a look **:)** – Anurag Dabas Jul 31 '21 at 15:43

merging rows in csv where row[0] is a duplicate

1 Answers1