3

I am parsing data from excel files and the columns of the resulting DataFrame may or may not align to a base DataFramewhere I want to stack several parsed DataFrame.

Lets call the DataFrame I parse from data A, and the base DataFrame df_A.

I read an excel shee resulting in A=

Index                    AGUB  AGUG   MUEB   MUEB    SIL    SIL   SILB   SILB
2012-01-01 00:00:00      0.00     0   0.00  50.78   0.00   0.00   0.00   0.00
2012-01-01 01:00:00      0.00     0   0.00  53.15   0.00  53.15   0.00   0.00
2012-01-01 02:00:00      0.00     0   0.00   0.00  53.15  53.15  53.15  53.15
2012-01-01 03:00:00      0.00     0   0.00   0.00   0.00  55.16   0.00   0.00
2012-01-01 04:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 05:00:00     48.96     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 06:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 07:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 08:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 09:00:00     52.28     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 10:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 11:00:00     36.93     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 12:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 13:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00  50.00
2012-01-01 14:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00  34.01
2012-01-01 15:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 16:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 17:00:00     53.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 18:00:00      0.00    75   0.00  75.00   0.00  75.00   0.00   0.00
2012-01-01 19:00:00      0.00    70   0.00  70.00   0.00   0.00   0.00   0.00
2012-01-01 20:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 21:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 22:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 23:00:00      0.00     0  53.45  53.45   0.00   0.00   0.00   0.00

I create the base dataframe:

units = ['MUE', 'MUEB', 'SIL', 'SILB', 'AGUG', 'AGUB', 'MUEBP', 'MUELP']
df_A = pd.DataFrame(columns=units)
df_A = pd.concat([df_A, A], axis=0)

Usually with concat if A had less columns than df_A it'll be fine, but in this case the only difference in the columns is the order. the concatenation leads to the following error:

ValueError: Plan shapes are not aligned

I'd like to know how to concatenate the two dataframes with the column order given by df_A.

Santi Peñate-Vera
  • 1,053
  • 4
  • 33
  • 68

2 Answers2

5

I've tried this and it doesn't matter whether there are more columns in the source, or target defined DataFrame - either way, the result is a dataframe that consists of a union of all supplied columns (with empty columns specified in the target, but not populated by the source populated with NaN).

Where I have been able to reproduce your error is where the column names in either the source or target dataframe include a duplicate name (or empty column names).

In your example, various columns appear more than once in your source file. I don't think concat copes very well with these kinds of duplicate columns.

import pandas as pd
s1 = [0,1,2,3,4,5]
s2 = [0,0,0,0,1,1]
A = pd.DataFrame([s2,s1],columns=['A','B','C','D','E','F'])

Resulting in:

A B C D E F
-----------
0 0 0 0 1 1 
0 1 2 3 4 5 

Take a subset of columns and use them to create a new dataframe called B

B = A[['A','C','E']]
 

A C E
-----
0 0 1 
0 2 4 

Create a new empty target dataframe

col_names = ['D','A','C','B']
Z = pd.DataFrame(columns=col_names)
D A C B
-------

And concatenate the two:

Z = pd.concat([B,Z],axis=0)
A  C  D   E
0  0  NaN 1 
0  2  NaN 4 

Works fine!

But if I recreate the empty dataframe using columns as so:

col_names = ['D','A','C','D']
Z = pd.DataFrame(columns=col_names)
    D A C D

And try to concatenate:

col_names = ['D','A','C','D']
Z = pd.DataFrame(columns=col_names)

Then I get the error you describe.

Jeremy Thompson
  • 61,933
  • 36
  • 195
  • 321
Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42
3

It's because of the duplicate columns in the data (SIL). See: Pandas concat gives error ValueError: Plan shapes are not aligned

Community
  • 1
  • 1
Def_Os
  • 5,301
  • 5
  • 34
  • 63