1

First question here and a long one - there are a couple of things I am struggling with regarding merging and formatting my dataframes. I have some half working solutions ones but I am unsure if they are the best possible based on what I want.

Here are the standard formats of the dataframes I am merging with pandas.

df1 =
      RT   %Area    RRT
0   4.83   5.257  0.509
1   6.76   0.424  0.712
2   7.27   0.495  0.766
3   7.70   0.257  0.811
4   7.79   0.122  0.821
5   9.49  92.763  1.000
6  11.40   0.681  1.201

df2= 
    RT   %Area    RRT
0   4.83   0.731  0.508
1   6.74   1.243  0.709
2   7.28   0.109  0.766
3   7.71   0.287  0.812
4   7.79   0.177  0.820
5   9.50  95.824  1.000
6  11.31   0.348  1.191
7  11.40   1.166  1.200
8  12.09   0.113  1.273

df3 = ...

Currently I am using a reduce operation on pd.merge_ordered() like below to merge my dataframes (3+). This kind of yields what I want and was from a previous question (pandas three-way joining multiple dataframes on columns). I am merging on RRT, and want the indexes with the same RRT values to be placed on the same row - and if the RRT values are unique for that dataset I want a NaN for missing data from other datasets.

#The for loop I use to generate the list of formatted dataframes prior to merging
dfs = []
for entry in os.scandir(directory):
    if (entry.path.endswith(".csv")) and entry.is_file():
        entry = pd.read_csv(entry.path, header=None)
        #Block of formatting code removed
        dfs.append(entry.round(2))

dfs = [df1ar,df2ar,df3ar]
df_final = reduce(lambda left,right: pd.merge_ordered(left,right,on='RRT'), dfs)
cols = ['RRT', 'RT_x', '%Area_x', 'RT_y', '%Area_y', 'RT', '%Area']
df_final = df_final[cols]
print(df_final)

      RRT   RT_x  %Area_x   RT_y  %Area_y     RT   %Area
0   0.508    NaN      NaN   4.83    0.731    NaN     NaN
1   0.509   4.83    5.257    NaN      NaN   4.83   5.257
2   0.709    NaN      NaN   6.74    1.243    NaN     NaN
3   0.712   6.76    0.424    NaN      NaN   6.76   0.424
4   0.766   7.27    0.495   7.28    0.109   7.27   0.495
5   0.811   7.70    0.257    NaN      NaN   7.70   0.257
6   0.812    NaN      NaN   7.71    0.287    NaN     NaN
7   0.820    NaN      NaN   7.79    0.177    NaN     NaN
8   0.821   7.79    0.122    NaN      NaN   7.79   0.122
9   1.000   9.49   92.763   9.50   95.824   9.49  92.763
10  1.191    NaN      NaN  11.31    0.348    NaN     NaN
11  1.200    NaN      NaN  11.40    1.166    NaN     NaN
12  1.201  11.40    0.681    NaN      NaN  11.40   0.681
13  1.273    NaN      NaN  12.09    0.113    NaN     NaN

This works, but:

  1. Can I can insert a multiindex based on the filename of the dataframe that the data came from from and place it above the corresponding columns? Like the suffix option but related back to filename and for more than two sets of data. Is this better done prior to merging? and if so how do I do it? (I've included the for loop I use for to create a list of tables prior to merging.

  2. Is this reduced merge_ordered the simplest way of doing this?

  3. Can I do a similar merge with pd.merge_asof() and use the tolerance value to fine tune the merging based on the similarities between the RRT values? That is, can it be done without cutting off data from the longer dataframes?

I've tried the above and searched for answers, but I'm struggling to find the most efficient way to do everything I want.

concat = pd.concat(dfs, axis=1, keys=['A','B','C'])
concat_final = concat.round(3)
print(concat_final)

  A                     B                     C               
      RT   %Area    RRT     RT   %Area    RRT     RT   %Area    RRT
0   4.83   5.257  0.509   4.83   0.731  0.508   4.83   5.257  0.509
1   6.76   0.424  0.712   6.74   1.243  0.709   6.76   0.424  0.712
2   7.27   0.495  0.766   7.28   0.109  0.766   7.27   0.495  0.766
3   7.70   0.257  0.811   7.71   0.287  0.812   7.70   0.257  0.811
4   7.79   0.122  0.821   7.79   0.177  0.820   7.79   0.122  0.821
5   9.49  92.763  1.000   9.50  95.824  1.000   9.49  92.763  1.000
6  11.40   0.681  1.201  11.31   0.348  1.191  11.40   0.681  1.201
7    NaN     NaN    NaN  11.40   1.166  1.200    NaN     NaN    NaN
8    NaN     NaN    NaN  12.09   0.113  1.273    NaN     NaN    NaN

I have also tried this - and I get the multiindex to denote which file (A,B,C, just as placeholders) it came from. However, it has obviously not merged based on the RRT value like I want.

  1. Can I apply an operation to change this into a similar format to the pd.merge_ordered() format above? Would groupby() work?

Thanks!

Joshoosh
  • 11
  • 2

0 Answers0