2

I have a dozen of Dataframes . Each one has a datetime index. Some dataframes have the same column name and some not. It is the same for the datetime index. Also, if two Dataframes have the same column, then they cannot have the same index. Then, the Dataframes don't have the same shape but all are based on a datetime index.

I would like to concatenate all the Dataframes into one by datetime index and also columns. If for a date, there is no value for one specific column, I want it to be NaN.

I am using pandas to use Dataframes in python.

I tried (with axis=0 or 1) :

data = pd.concat(list_dataframes)

but it doesn't work: same dates appear several times or there is an error because the Dataframes don't have the same length (datetime range).

To sum up, I want to be able to have one dataframe with the entire datetime index and columns which exist among all the dataframes that I have.

Thanks for your help.

A sample of one of the dataframes: enter image description here

Juan
  • 184
  • 1
  • 4
  • 16

1 Answers1

1

Sometimes it is less effort to guess. What I imagined OP was asking is that there are a bunch of dataframes in which some share common columns and others share common indices. Therefore some need to be combined along axis 0 while the results of those concatenations need to be concatenated along axis 1.

Setup

Assume these 9 dataframes in which there are three types of columns. Those with columns ['A', 'B', 'C'], ['D', 'E', 'F'], and ['G', 'H', 'I']. And also, three types of indices. Those with the month ends for the year 2011, 2012, and 2013.

tidx1 = pd.date_range('2010-12-31', freq='M', periods=12)
tidx2 = pd.date_range('2011-12-31', freq='M', periods=12)
tidx3 = pd.date_range('2012-12-31', freq='M', periods=12)

cols1 = [*'ABC']
cols2 = [*'DEF']
cols3 = [*'GHI']

np.random.seed([3, 1415])
dfs = [pd.DataFrame(np.random.rand(12, 3), i, c)
       for i in [tidx1, tidx2, tidx3] for c in [cols1, cols2, cols3]]

If I printed the first rows:

print(*[d.head(1) for d in dfs], sep='\n\n')

                   A         B         C
2010-12-31  0.444939  0.407554  0.460148

                   D         E         F
2010-12-31  0.764869  0.253200  0.548054

                   G         H         I
2010-12-31  0.067359  0.774688  0.009526

                   A         B         C
2011-12-31  0.698617  0.979221  0.037819

                   D         E         F
2011-12-31  0.813683  0.024600  0.978336

                   G         H         I
2011-12-31  0.038986  0.006073  0.254859

                   A         B         C
2012-12-31  0.123621  0.842938  0.580312

                   D         E         F
2012-12-31  0.663650  0.215170  0.382212

                   G         H         I
2012-12-31  0.068951  0.294619  0.285245

Solution

I want to first group by columns, then use pd.concat in a comprehension

df_groups = {}

for df in dfs:
    key = frozenset(df.columns)
    df_groups.setdefault(key, []).append(df)

pd.concat([
    pd.concat([dc for dc in df_sub])
    for df_sub in df_groups.values()
], axis=1)

OUTPUT


                   A         B         C         D         E         F         G         H         I
2010-12-31  0.444939  0.407554  0.460148  0.764869  0.253200  0.548054  0.067359  0.774688  0.009526
2011-01-31  0.465239  0.462691  0.016545  0.778883  0.651676  0.136097  0.149733  0.585309  0.811828
2011-02-28  0.850445  0.817744  0.777962  0.544838  0.035073  0.275079  0.639003  0.132974  0.883372
2011-03-31  0.757983  0.934829  0.831104  0.706685  0.713614  0.776050  0.882258  0.391942  0.099619
2011-04-30  0.879891  0.926879  0.721535  0.542329  0.836541  0.538186  0.071612  0.651631  0.015642
2011-05-31  0.117642  0.145906  0.199844  0.185523  0.652151  0.746060  0.348386  0.606346  0.401578
2011-06-30  0.437564  0.100702  0.278735  0.373741  0.603536  0.775801  0.847639  0.641232  0.250046
2011-07-31  0.609862  0.085823  0.836997  0.091238  0.504035  0.671320  0.503790  0.440365  0.243070
2011-08-31  0.739635  0.866059  0.691271  0.619939  0.301644  0.956463  0.461382  0.283622  0.109229
2011-09-30  0.377185  0.225146  0.435280  0.702457  0.367810  0.454935  0.314665  0.225839  0.859034
2011-10-31  0.700900  0.700946  0.796487  0.882029  0.880251  0.496250  0.010327  0.639184  0.205293
2011-11-30  0.018688  0.700566  0.900749  0.805688  0.038985  0.438329  0.129987  0.448391  0.366678
2011-12-31  0.698617  0.979221  0.037819  0.813683  0.024600  0.978336  0.038986  0.006073  0.254859
2012-01-31  0.560599  0.773646  0.723712  0.616345  0.445424  0.055701  0.682586  0.226415  0.991061
2012-02-29  0.029056  0.021850  0.112871  0.541340  0.735793  0.163323  0.990199  0.550963  0.121778
2012-03-31  0.375866  0.667671  0.636299  0.884157  0.592664  0.114829  0.744816  0.897881  0.139763
2012-04-30  0.474412  0.800556  0.364394  0.864268  0.117149  0.709288  0.567098  0.069015  0.780812
2012-05-31  0.728499  0.071969  0.305145  0.066660  0.700911  0.198066  0.957639  0.286822  0.951477
2012-06-30  0.467122  0.111486  0.537449  0.552404  0.105812  0.369154  0.015311  0.463177  0.084191
2012-07-31  0.624568  0.246598  0.887655  0.809073  0.470078  0.025123  0.807285  0.425915  0.314156
2012-08-31  0.119408  0.589169  0.384039  0.881872  0.378020  0.872950  0.673316  0.351912  0.039022
2012-09-30  0.857798  0.654579  0.185472  0.413052  0.354193  0.294681  0.495490  0.446053  0.195945
2012-10-31  0.737311  0.866795  0.080936  0.311414  0.627207  0.165256  0.904418  0.534207  0.140122
2012-11-30  0.297743  0.336371  0.400472  0.575506  0.206212  0.431040  0.878589  0.096615  0.415686
2012-12-31  0.123621  0.842938  0.580312  0.663650  0.215170  0.382212  0.068951  0.294619  0.285245
2013-01-31  0.709768  0.604121  0.593387  0.146662  0.019847  0.361099  0.001559  0.961979  0.415321
2013-02-28  0.960484  0.795752  0.483328  0.568315  0.015255  0.314391  0.224125  0.149543  0.873957
2013-03-31  0.135724  0.876926  0.653998  0.292689  0.115870  0.147942  0.976790  0.564139  0.588085
2013-04-30  0.228189  0.297403  0.007808  0.420111  0.262472  0.809144  0.731854  0.313508  0.262154
2013-05-31  0.134206  0.081202  0.122216  0.375930  0.112703  0.023932  0.880654  0.897069  0.876440
2013-06-30  0.908906  0.676719  0.032016  0.307829  0.923390  0.768263  0.084670  0.184191  0.374396
2013-07-31  0.776158  0.911811  0.826244  0.227400  0.325776  0.462828  0.678547  0.266153  0.387923
2013-08-31  0.993297  0.405039  0.246078  0.893535  0.295378  0.380805  0.372939  0.358029  0.791162
2013-09-30  0.185163  0.968741  0.224807  0.168866  0.709602  0.994510  0.800004  0.540143  0.671301
2013-10-31  0.723072  0.554268  0.293442  0.158512  0.145731  0.011970  0.078916  0.513555  0.068210
2013-11-30  0.654131  0.006453  0.190228  0.555713  0.302839  0.035536  0.436647  0.635284  0.351624
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thank you it was what I was looking for. However, I have this following mistake: "Reindexing only valid with uniquely valued Index objects", when I have two columns with same name but with different indexes. I used your code from the "Solution" part. I replace dfs by list_dataframes which is the list of all dataframes that I have. Do you know how to solve it? – Juan Apr 25 '19 at 15:09
  • 1
    That is an error specific to your data and refers to the indices or columns not being unique. For example, if one of the dataframes had two columns both named `'A'` or two index values both with date `'2011-10-31'`. I can't say for sure without your data. – piRSquared Apr 25 '19 at 15:12
  • Indeed, some dataframes had some duplicated indices. Thanks a lot. – Juan Apr 26 '19 at 06:48