2

I'm trying to merging the panel dataframes by idcode and do a sorting process by 'idcode' in my dataframes.

My data variable names are wave68, wave69...., wave71.

Overall, I have two problems:

  1. First, I want to sort the dataframes by using a loop, but I don't know how to assign the looping file names, i.e. wave+i?

  2. Second, I don't know how to make the loop algorithm to merge the dataframes correctly.

The final result that I want is wide form Panel data sorted by 'idcode' by wave68,69,70.. by using loop.

import pandas as pd
import numpy as np

wave68 = pd.read_csv('panel_data/wave68.csv')
wave69 = pd.read_csv('panel_data/wave69.csv')
wave70 = pd.read_csv('panel_data/wave70.csv')
wave71 = pd.read_csv('panel_data/wave71.csv')

df = [wave68,wave69,wave70,wave71]

def my_sorter(file_name,var):
    for i in file_name:        
        file_name[i].sort_values(by=[var])

wave68 = wave68.sort_values(by=['idcode'])
wave69 = wave69.sort_values(by=['idcode'])
wave70 = wave70.sort_values(by=['idcode'])

merged = pd.merge(wave68, wave69, on='idcode')
merged = pd.merge(merged, wave70, on='idcode')
merged = pd.merge(merged, wave71, on='idcode')

merged.head(20)
Parfait
  • 104,375
  • 17
  • 94
  • 125
Beom Kim
  • 23
  • 5

1 Answers1

0

For merging across multiple dfs, consider the chain merge with reduce. Alternatively, if using one column, run horizontal merge with pd.concat() setting axis=1 but prior to that set the index of each dataframe to the join column. Afterwards, you can then sort on your final dataframe only once.

Do note: unlike other data analytics tools, in pandas you do not need to sort pior to merging.

Merge with reduce (only matched idcodes across all dfs will be kept, otherwise use a left join)

from functools import reduce
...
dfList = [wave68, wave69, wave70, wave71]

finaldf = reduce(lambda left, right: pd.merge(left, right, on=['idcode']), dfList).sort_values(['idcode'])

Merge with pd.concat (any unmatched idcodes will retain NaNs)

dfList = [wave68, wave69, wave70, wave71]

finaldf = pd.concat([df.set_index('idcode') for df in dfList], axis=1).sort_index()
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks it is so helpful for me! Is there any method when i want to make very long dfList like wave68, ..., wave150.? – Beom Kim Dec 05 '17 at 06:39
  • Great! Glad to help. During your csv import loop, append each df to the list iteratively instead of afterwards. In fact don't ever save dfs separately as you avoid flooding your global environment. – Parfait Dec 05 '17 at 12:14