0

Suppose I'm given a large sets of csv files and I would like to read them into python and combine them in a certain way. Some key characteristics about the files

  • Index is like a time dimension
  • Column are specific entities
  • Index of different files may be overlapping, but the value should be the same
  • not all files have the same amount of columns

At the end I would like to have a huge data frame, collecting all entities over the whole time window.

Let me give an example

import pandas as pd

df_1 = pd.DataFrame(data=[[np.nan, 2],[np.nan, 4],[np.nan, 8]],index = [0,1,2],columns=["A","B"])
df_2 = pd.DataFrame(data=[[4, 8, 2],[4, 2, 8],[6, 1, 1]],index = [2,3,4],columns=["A","B","C"])
df_3 = pd.DataFrame(data=[[2, np.nan, 4],[8, np.nan, 8],[1, 8, 10]],index = [2,3,4],columns=["C","D","E"])
df_end = pd.DataFrame(data=[[np.nan, 2, np.nan, np.nan, np.nan],[np.nan, 4, np.nan, np.nan, np.nan],[4, 8, 2, np.nan, 4], [4, 2, 8, np.nan, 8], [6, 1, 1, 8, 10]],index = [0, 1, 2, 3, 4],columns=["A","B","C","D","E"])

So as you see, we have three data frames and I envision to have the end result as outlined in df_end. In reality, the data frames are much larger but the toy example should make clear what I try to achieve.

I've tried to do something like this:

temp = pd.concat([df_1,df_2,df_3])
temp[~temp.index.duplicated(keep="first")]

which didn't show the desired result.

swissy
  • 7
  • 3

1 Answers1

0

Use combine_first in order of preference:

out = df_1.combine_first(df_2).combine_first(df_3)

Generalization with functools.reduce:

from functools import reduce

dfs = [df_1, df_2, df_3]
out = reduce(lambda a, b: a.combine_first(b), dfs)

Output:

     A  B    C    D     E
0  NaN  2  NaN  NaN   NaN
1  NaN  4  NaN  NaN   NaN
2  4.0  8  2.0  NaN   4.0
3  4.0  2  8.0  NaN   8.0
4  6.0  1  1.0  8.0  10.0

Alternative with concat:

out = pd.concat([df_1, df_2, df_3]).groupby(level=0).first()
mozway
  • 194,879
  • 13
  • 39
  • 75