124

I'm trying to merge a (Pandas 14.1) dataframe and a series. The series should form a new column, with some NAs (since the index values of the series are a subset of the index values of the dataframe).

This works for a toy example, but not with my data (detailed below).

Example:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randn(6, 4), columns=['A', 'B', 'C', 'D'], index=pd.date_range('1/1/2011', periods=6, freq='D'))
df1

A   B   C   D
2011-01-01  -0.487926   0.439190    0.194810    0.333896
2011-01-02  1.708024    0.237587    -0.958100   1.418285
2011-01-03  -1.228805   1.266068    -1.755050   -1.476395
2011-01-04  -0.554705   1.342504    0.245934    0.955521
2011-01-05  -0.351260   -0.798270   0.820535    -0.597322
2011-01-06  0.132924    0.501027    -1.139487   1.107873

s1 = pd.Series(np.random.randn(3), name='foo', index=pd.date_range('1/1/2011', periods=3, freq='2D'))
s1

2011-01-01   -1.660578
2011-01-03   -0.209688
2011-01-05    0.546146
Freq: 2D, Name: foo, dtype: float64

pd.concat([df1, s1],axis=1)

A   B   C   D   foo
2011-01-01  -0.487926   0.439190    0.194810    0.333896    -1.660578
2011-01-02  1.708024    0.237587    -0.958100   1.418285    NaN
2011-01-03  -1.228805   1.266068    -1.755050   -1.476395   -0.209688
2011-01-04  -0.554705   1.342504    0.245934    0.955521    NaN
2011-01-05  -0.351260   -0.798270   0.820535    -0.597322   0.546146
2011-01-06  0.132924    0.501027    -1.139487   1.107873    NaN

The situation with the data (see below) seems basically identical - concatting a series with a DatetimeIndex whose values are a subset of the dataframe's. But it gives the ValueError in the title (blah1 = (5, 286) blah2 = (5, 276) ). Why doesn't it work?:

In[187]: df.head()
Out[188]:
high    low loc_h   loc_l
time                
2014-01-01 17:00:00 1.376235    1.375945    1.376235    1.375945
2014-01-01 17:01:00 1.376005    1.375775    NaN NaN
2014-01-01 17:02:00 1.375795    1.375445    NaN 1.375445
2014-01-01 17:03:00 1.375625    1.375515    NaN NaN
2014-01-01 17:04:00 1.375585    1.375585    NaN NaN
In [186]: df.index
Out[186]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 17:00:00, ..., 2014-01-01 21:30:00]
Length: 271, Freq: None, Timezone: None

In [189]: hl.head()
Out[189]:
2014-01-01 17:00:00    1.376090
2014-01-01 17:02:00    1.375445
2014-01-01 17:05:00    1.376195
2014-01-01 17:10:00    1.375385
2014-01-01 17:12:00    1.376115
dtype: float64

In [187]:hl.index
Out[187]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 17:00:00, ..., 2014-01-01 21:30:00]
Length: 89, Freq: None, Timezone: None

In: pd.concat([df, hl], axis=1)
Out: [stack trace] ValueError: Shape of passed values is (5, 286), indices imply (5, 276)
birone
  • 2,039
  • 6
  • 17
  • 18
  • 2
    Have you tried `append` instead of `concat`? And if I understand the `ValueError` correctly it's saying there are 286 rows of data, but the indices of the data frame are expecting 276 rows. Try checking out `len(df.index)` and `len(h1.index)`. – alacy Dec 31 '14 at 13:03
  • df.append(hl) fails with TypeError: 'NoneType' object is not iterable. But then I tried join - thanks! :) – birone Dec 31 '14 at 13:16
  • No problem. Make sure to mark your answer as correct so future SO users can find your solution quickly if they have a similar problem. – alacy Dec 31 '14 at 13:37
  • Will do... when it lets me. – birone Dec 31 '14 at 13:44
  • 4
    The error message could be a lot more helpful, like maybe saying "you probably have some duplicate indices"... – wordsforthewise Jul 13 '18 at 17:02

7 Answers7

85

I had a similar problem (join worked, but concat failed).

Check for duplicate index values in df1 and s1, (e.g. df1.index.is_unique)

Removing duplicate index values (e.g., df.drop_duplicates(inplace=True)) or one of the methods here https://stackoverflow.com/a/34297689/7163376 should resolve it.

Tim Stack
  • 3,209
  • 3
  • 18
  • 39
lmart999
  • 6,671
  • 10
  • 29
  • 37
  • 4
    That worked thanks! I'm doing it like this: df = pd.concat([df1, df2], axis=1, join_axes=[df1.index]). If I have dups in df2 then I get this error. Makes sense as it doesn't know how to map multiple duplicate indexes across both DFs. – sparrow Mar 28 '17 at 18:33
  • 3
    To drop duplicate indices, use `df = df.loc[df.index.drop_duplicates()]`. C.f. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.drop_duplicates.html – BallpointBen Apr 18 '18 at 15:25
  • 3
    The suggestion to check for duplicate index values in both indices is likely what will help many people reading this question – dsugasa Apr 02 '20 at 12:25
  • To drop duplicate indices, best could be `df = df[~df.index.duplicated(keep='first')]`see https://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries – ztl Nov 25 '20 at 09:36
43

My problem were different indices, the following code solved my problem.

df1.reset_index(drop=True, inplace=True)
df2.reset_index(drop=True, inplace=True)
df = pd.concat([df1, df2], axis=1)
fses91
  • 1,812
  • 1
  • 11
  • 16
  • 1
    I ended up with this problem and reset_index() solved it. What was the problem in the original index and how did reset_index() solve it? – rubpa Dec 02 '20 at 06:45
7

To drop duplicate indices, use df = df.loc[df.index.drop_duplicates()]. C.f. pandas.pydata.org/pandas-docs/stable/generated/… – BallpointBen Apr 18 at 15:25

This is wrong but I can't reply directly to BallpointBen's comment due to low reputation. The reason its wrong is that df.index.drop_duplicates() returns a list of unique indices, but when you index back into the dataframe using those the unique indices it still returns all records. I think this is likely because indexing using one of the duplicated indices will return all instances of the index.

Instead, use df.index.duplicated(), which returns a boolean list (add the ~ to get the not-duplicated records):

df = df.loc[~df.index.duplicated()]
ASGM
  • 11,051
  • 1
  • 32
  • 53
Jeremy Matt
  • 647
  • 1
  • 7
  • 10
5

Aus_lacy's post gave me the idea of trying related methods, of which join does work:

In [196]:

hl.name = 'hl'
Out[196]:
'hl'
In [199]:

df.join(hl).head(4)
Out[199]:
high    low loc_h   loc_l   hl
2014-01-01 17:00:00 1.376235    1.375945    1.376235    1.375945    1.376090
2014-01-01 17:01:00 1.376005    1.375775    NaN NaN NaN
2014-01-01 17:02:00 1.375795    1.375445    NaN 1.375445    1.375445
2014-01-01 17:03:00 1.375625    1.375515    NaN NaN NaN

Some insight into why concat works on the example but not this data would be nice though!

birone
  • 2,039
  • 6
  • 17
  • 18
3

Your indexes probably contains duplicated values.

import pandas as pd

T1_INDEX = [
    0,
    1,  # <= !!! if I write e.g.: "0" here then it fails
    0.2,
]
T1_COLUMNS = [
    'A', 'B', 'C', 'D'
]
T1 = [
    [1.0, 1.1, 1.2, 1.3],
    [2.0, 2.1, 2.2, 2.3],
    [3.0, 3.1, 3.2, 3.3],
]

T2_INDEX = [
    1.2,
    2.11,
]

T2_COLUMNS = [
    'D', 'E', 'F',
]
T2 = [
    [54.0, 5324.1, 3234.2],
    [55.0, 14.5324, 2324.2],
    # [3.0, 3.1, 3.2],
]
df1 = pd.DataFrame(T1, columns=T1_COLUMNS, index=T1_INDEX)
df2 = pd.DataFrame(T2, columns=T2_COLUMNS, index=T2_INDEX)


print(pd.concat([pd.DataFrame({})] + [df2, df1], axis=1))
kfr
  • 61
  • 6
1

Try sorting index after concatenating them

result=pd.concat([df1,df2]).sort_index()
Skatox
  • 4,237
  • 12
  • 42
  • 47
-2

Maybe it is simple, try this if you have a DataFrame. then make sure that both matrices or vectros that you're trying to combine have the same rows_name/index

I had the same issue. I changed the name indices of the rows to make them match each other here is an example for a matrix (principal component) and a vector(target) have the same row indicies (I circled them in the blue in the leftside of the pic)

Before, "when it was not working", I had the matrix with normal row indicies (0,1,2,3) while I had the vector with row indices (ID0, ID1, ID2, ID3) then I changed the vector's row indices to (0,1,2,3) and it worked for me.

enter image description here

Ahmad
  • 1