Suppose I'm given a large sets of csv files and I would like to read them into python and combine them in a certain way. Some key characteristics about the files
- Index is like a time dimension
- Column are specific entities
- Index of different files may be overlapping, but the value should be the same
- not all files have the same amount of columns
At the end I would like to have a huge data frame, collecting all entities over the whole time window.
Let me give an example
import pandas as pd
df_1 = pd.DataFrame(data=[[np.nan, 2],[np.nan, 4],[np.nan, 8]],index = [0,1,2],columns=["A","B"])
df_2 = pd.DataFrame(data=[[4, 8, 2],[4, 2, 8],[6, 1, 1]],index = [2,3,4],columns=["A","B","C"])
df_3 = pd.DataFrame(data=[[2, np.nan, 4],[8, np.nan, 8],[1, 8, 10]],index = [2,3,4],columns=["C","D","E"])
df_end = pd.DataFrame(data=[[np.nan, 2, np.nan, np.nan, np.nan],[np.nan, 4, np.nan, np.nan, np.nan],[4, 8, 2, np.nan, 4], [4, 2, 8, np.nan, 8], [6, 1, 1, 8, 10]],index = [0, 1, 2, 3, 4],columns=["A","B","C","D","E"])
So as you see, we have three data frames and I envision to have the end result as outlined in df_end
. In reality, the data frames are much larger but the toy example should make clear what I try to achieve.
I've tried to do something like this:
temp = pd.concat([df_1,df_2,df_3])
temp[~temp.index.duplicated(keep="first")]
which didn't show the desired result.