33

I am trying to concat dataframes based on the foll. 2 csv files:

df_a: https://www.dropbox.com/s/slcu7o7yyottujl/df_current.csv?dl=0

df_b: https://www.dropbox.com/s/laveuldraurdpu1/df_climatology.csv?dl=0

Both of these have the same number and names of columns. However, when I do this:

pandas.concat([df_a, df_b])

I get the error:

AssertionError: Number of manager items must equal union of block items
# manager items: 20, # tot_items: 21

How to fix this?

user308827
  • 21,227
  • 87
  • 254
  • 417
  • 1
    Just tried with your data and `pandas==0.17.1` and `concat` works fine. – Stefan Feb 01 '16 at 18:54
  • hmm, not sure what is happening....i still get the error, I am using pandas == 0.17.1 as well – user308827 Feb 01 '16 at 18:59
  • I'm using pandas 0.17.1, Python 2.7.11 on Ubuntu 14.04, and for me it is working fine also. – agold Feb 01 '16 at 19:13
  • I check column names `print df_a.columns == df_b.columns` and output: `[ True True True True True True True True True True True True True True False False True False True False False]` – jezrael Feb 01 '16 at 19:17
  • thanks @jezrael, the column names are not in the same order, but they are all present. – user308827 Feb 01 '16 at 19:21

4 Answers4

44

I believe that this error occurs if the following two conditions are met:

  1. The data frames have different columns. (i.e. (df1.columns == df2.columns) is False
  2. The columns has a repeated value.

Basically if you concat dataframes with columns [A,B,C] and [B,C,D] it can work out to make one series for each distinct column name. So if I try to join a third dataframe [B,B,C] it does not know which column to append and ends up with fewer distinct columns than it thinks it needs.

If your dataframes are such that df1.columns == df2.columns then it will work anyway. So you can join [B,B,C] to [B,B,C], but not to [C,B,B], as if the columns are identical it probably just uses the integer indexes or something.

Daniel Holmes
  • 1,952
  • 2
  • 17
  • 28
phil_20686
  • 4,000
  • 21
  • 38
  • Best explanation I've seen on this issue. Thanks. – Jonathan Nappee Nov 15 '18 at 15:10
  • I was having a problem in the spatial extension geopandas where the `.overlay()` operation was failing due to an error very similar to the original post. It seems that if you have the same column name if both geodataframes, it will enumerate them in the output dataframe ONLY ONCE. On the third overlay operation, it will throw this error. So if you are making a chain-overlay, make sure the column names are different for each geodataframe in the chain. – wfgeo Aug 16 '19 at 11:31
  • Thanks! & FYI to find duplicate columns: duplicates = df.columns.duplicated(keep=False) [x[0] for x in tuple(zip(df.columns , duplicates)) if x[1]] – Wouter Feb 08 '20 at 09:00
  • Repeated Columns! Of course, thanks a lot for the clear answer ! – FiercestJim Dec 16 '20 at 20:54
9

The answers here did not solve my issue, but this answer did.

The Issue was duplicated columns in one or both DataFrames.

Here's a duplicated column fix(as per answer above):

df = df.loc[:,~df.columns.duplicated()]
Ukrainian-serge
  • 854
  • 7
  • 12
6

You can get around this issue with a 'manual' concatenation, in this case your

list_of_dfs = [df_a, df_b]

And instead of running

giant_concat_df = pd.concat(list_of_dfs,0)

You can use turn all of the dataframes to a list of dictionaries and then make a new data frame from these lists (merged with chain)

from itertools import chain
list_of_dicts = [cur_df.T.to_dict().values() for cur_df in list_of_dfs]    
giant_concat_df = pd.DataFrame(list(chain(*list_of_dicts)))
kmader
  • 1,319
  • 1
  • 10
  • 13
  • 1
    Please be aware that this solution will take a significantly different time to complete and will consume a significant amount of memory too on large data frames. – Karatheodory May 28 '19 at 12:03
2

Unfortunately, the source files are already unavailable, so I can't check my solution in your case. In my case the error occurred when:

  1. Data frames have two columns with the same name (I've had ID and id columns, which I then converted to lower case, so they become the same)
  2. Value types of the same-named columns are different

Here is an example which gives me the error in question:

df1 = pd.DataFrame(data=[
    ['a', 'b', 'id', 1],
    ['a', 'b', 'id', 2]
], columns=['A', 'B', 'id', 'id'])

df2 = pd.DataFrame(data=[
    ['b', 'c', 'id', 1],
    ['b', 'c', 'id', 2]
], columns=['B', 'C', 'id', 'id'])
pd.concat([df1, df2])
>>> AssertionError: Number of manager items must equal union of block items
 # manager items: 4, # tot_items: 5

Removing / renaming one of the columns makes this code work.

Karatheodory
  • 895
  • 10
  • 16