79

In pandas, I am attempting to concatenate a set of dataframes and I am getting this error:

ValueError: Plan shapes are not aligned

My understanding of .concat() is that it will join where columns are the same, but for those that it can't find it will fill with NA. This doesn't seem to be the case here.

Here's the concat statement:

dfs = [npo_jun_df, npo_jul_df,npo_may_df,npo_apr_df,npo_feb_df]
alpha = pd.concat(dfs)
Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
Lt.Fr0st
  • 999
  • 1
  • 7
  • 10
  • 4
    You're simply going to have to post some data and minimal example that throws this error. Bonus points for mocking up some output that you'd like to see. – Paul H Oct 07 '14 at 00:33
  • 2
    You have a trailing comma in your list: `dfs = [npo_jun_df, npo_jul_df,npo_may_df,npo_apr_df,npo_feb_df,<------ ]` remove this and try again – EdChum Oct 07 '14 at 07:31
  • 3
    Actually it should not make a difference if you have a trailing comma in your list it should still work, and even if your columns clashed it should still work. You need to post data and code that we can use to reproduce your error, note that I am running pandas version 0.14.1, what version of pandas, numpy and python are you running – EdChum Oct 07 '14 at 08:02
  • 2
    I recently had the same error, it turns out that I had a duplicate column name in the dataframe `df_e` when joining using an append statement `df=df_t.append(df_e)`. Before the statement worked fine, then I accidentally added a duplicated column and it gave them the same error statement as above. – jason Mar 06 '15 at 12:18
  • 1
    If others are getting this error (most likely due to duplicated column names) try: `df.columns.duplicated()` for each of the data frames. `pd.concat` can't handle duplicated column names. – Ram Narasimhan Feb 25 '20 at 15:50

7 Answers7

157

In case it helps, I have also hit this error when I tried to concatenate two data frames (and as of the time of writing this is the only related hit I can find on google other than the source code).

I don't know whether this answer would have solved the OP's problem (since he/she didn't post enough information), but for me, this was caused when I tried to concat dataframe df1 with columns ['A', 'B', 'B', 'C'] (see the duplicate column headings?) with dataframe df2 with columns ['A', 'B']. Understandably the duplication caused pandas to throw a wobbly. Change df1 to ['A', 'B', 'C'] (i.e. drop one of the duplicate columns) and everything works fine.

Niels Hameleers
  • 1,201
  • 10
  • 11
user3805082
  • 2,076
  • 1
  • 14
  • 9
  • 23
    For help with eliminating duplicate column names, try `df = df.loc[:,~df.columns.duplicated()]` from Gene Burinsky's answer to this question [link](http://stackoverflow.com/questions/14984119/python-pandas-remove-duplicate-columns) – Paul J Feb 01 '17 at 19:15
  • actually when we read_excel to a dataframe... pandas can automatically take care of duplicate names.. and adds a suffix ".1" ".2" after duplicated column name... this problem only happens.. when we ignore the auto column naming.. and write our own column headers. – ihightower Apr 08 '17 at 18:50
  • please also see below link on how to rename automatically.... http://stackoverflow.com/questions/24685012/pandas-dataframe-renaming-multiple-identically-named-columns (this worked for me with this issue) – ihightower Apr 08 '17 at 19:06
12

I recently got this message, too, and I found like user @jason and @user3805082 above that I had duplicate columns in several of the hundreds of dataframes I was trying to concat, each with dozens of enigmatic varnames. Manually searching for duplicates was not practical.

In case anyone else has the same problem, I wrote the following function which might help out.

def duplicated_varnames(df):
    """Return a dict of all variable names that 
    are duplicated in a given dataframe."""
    repeat_dict = {}
    var_list = list(df) # list of varnames as strings
    for varname in var_list:
        # make a list of all instances of that varname
        test_list = [v for v in var_list if v == varname] 
        # if more than one instance, report duplications in repeat_dict
        if len(test_list) > 1: 
            repeat_dict[varname] = len(test_list)
    return repeat_dict

Then you can iterate over that dict to report how many duplicates there are, delete the duplicated variables, or rename them in some systematic way.

Community
  • 1
  • 1
William Welsh
  • 351
  • 2
  • 11
3

Wrote a small function to concatenate duplicated column names. Function cares about sorting if original dataframe is unsorted, the output will be a sorted one.

def concat_duplicate_columns(df):
    dupli = {}
    # populate dictionary with column names and count for duplicates 
    for column in df.columns:
        dupli[column] = dupli[column] + 1 if column in dupli.keys() else 1
    # rename duplicated keys with °°° number suffix
    for key, val in dict(dupli).items():
        del dupli[key]
        if val > 1:
            for i in range(val):
                dupli[key+'°°°'+str(i)] = val
        else: dupli[key] = 1
    # rename columns so that we can now access abmigous column names
    # sorting in dict is the same as in original table
    df.columns = dupli.keys()
    # for each duplicated column name
    for i in set(re.sub('°°°(.*)','',j) for j in dupli.keys() if '°°°' in j):
        i = str(i)
        # for each duplicate of a column name
        for k in range(dupli[i+'°°°0']-1):
            # concatenate values in duplicated columns
            df[i+'°°°0'] = df[i+'°°°0'].astype(str) + df[i+'°°°'+str(k+1)].astype(str)
            # Drop duplicated columns from which we have aquired data
            df = df.drop(i+'°°°'+str(k+1), 1)
    # resort column names for proper mapping
    df = df.reindex_axis(sorted(df.columns), axis = 1)
    # rename columns
    df.columns = sorted(set(re.sub('°°°(.*)','',i) for i in dupli.keys()))
    return df
DiMithras
  • 605
  • 6
  • 14
2

You need to have the same header names for all the df you want to concat.

Do it for example with :

headername = list(df)

Data = Data.filter(headername)

2

How to reproduce above error from pandas.concat(...):

ValueError: Plan shapes are not aligned

The Python (3.6.8) code:

import pandas as pd
df = pd.DataFrame({"foo": [3] })
print(df)
df2 = pd.concat([df, df], axis="columns")
print(df2)
df3 = pd.concat([df2, df], sort=False) #ValueError: Plan shapes are not aligned

which prints:

   foo
0    3

   foo  foo
0    3    3
ValueError: Plan shapes are not aligned

Explanation of error

If the first pandas dataframe (here df2) has a duplicate named column and is sent to pd.concat and the second dataframe isn't of the same dimension as the first, then you get this error.

Solution

Make sure there are no duplicate named columns:

df_onefoo = pd.DataFrame({"foo": [3] })
print(df_onefoo)
df_onebar = pd.DataFrame({"bar": [3] })
print(df_onebar)
df2 = pd.concat([df_onefoo, df_onebar], axis="columns")
print(df2)
df3 = pd.concat([df2, df_onefoo], sort=False)
print(df2)

prints:

   foo
0    3

   bar
0    3

   foo  bar
0    3    3

   foo  bar
0    3    3

Pandas concat could have been more helpful with that error message. It's a straight up bubbleup-implementation-itis, which is textbook python.

Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
1

I was receiving the ValueError: Plan shapes are not aligned when adding dataframes together. I was trying to loop over Excel sheets and after cleaning concacting them together.

The error was being raised as their were multiple none columns which I dropped with the code below:

df = df.loc[:, df.columns.notnull()] # found on stackoverflow

IG2013
  • 149
  • 1
  • 5
0

Error is result of having duplicate columns. Use following function in order to remove duplicate function without impacting data.

def duplicated_varnames(df):
    repeat_dict = {}
    var_list = list(df) # list of varnames as strings
    for varname in var_list:
        test_list = [v for v in var_list if v == varname] 
        if len(test_list) > 1: 
            repeat_dict[varname] = len(test_list)
        if len(repeat_dict)>0:
            df = df.loc[:,~df.columns.duplicated()]
    return df
Ruli
  • 2,592
  • 12
  • 30
  • 40