488

I am getting a ValueError: cannot reindex from a duplicate axis when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.

Here is my session inside of ipdb trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?

I don't really understand what ValueError: cannot reindex from a duplicate axismeans, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.

ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')

ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False

Here is the error:

ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis

I tried to reproduce this with a simple example, but I failed

In [32]: import pandas as pd

In [33]: import numpy as np

In [34]: a = np.arange(35).reshape(5,7)

In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))

In [36]: df.values.dtype
Out[36]: dtype('int64')

In [37]: df.loc['sums'] = df.sum(axis=0)

In [38]: df
Out[38]: 
      10  11  12  13  14  15   16
x      0   1   2   3   4   5    6
y      7   8   9  10  11  12   13
u     14  15  16  17  18  19   20
z     21  22  23  24  25  26   27
w     28  29  30  31  32  33   34
sums  70  75  80  85  90  95  100
dagrha
  • 2,449
  • 1
  • 20
  • 21
Akavall
  • 82,592
  • 51
  • 207
  • 251
  • 1
    Is there any chance that you obfuscated the real column names of your affinity matrix? (i.e. replaced the real values with something else to hide sensitive information) – Korem Dec 01 '14 at 20:12
  • @Korem, I don't think this is true, but even if this is true, why would this cause the above error? – Akavall Dec 01 '14 at 21:10
  • 6
    I usually see this when the index assigned to has duplicate values. Since in your case you're assigning a row, I expected a duplicate in the column names. That's why I asked. – Korem Dec 01 '14 at 21:11
  • @Korem, Indeed my actual data had duplicate index values, and I was able to reproduce the error in the small example when duplicate index values were present. You fully answered my question. Thank You. Do you mind putting it as an answer? – Akavall Dec 01 '14 at 21:17
  • If you are trying to assing , merge etc and getting this error a reset index will do ```df = df.assign(y=df2["y"].reset_index(drop=True))``` – Alex Punnen Apr 28 '22 at 12:08
  • Pandas: This bogus error should be changed to ... ValueError: cannot reindex an axis with a duplicate value – gseattle Sep 19 '22 at 16:26

18 Answers18

320

This error usually rises when you join / assign to a column when the index has duplicate values. Since you are assigning to a row, I suspect that there is a duplicate value in affinity_matrix.columns, perhaps not shown in your question.

Korem
  • 11,383
  • 7
  • 55
  • 72
  • 27
    To be more accurate, in my case a duplicate value was in `affinity_matrix.index`, but I think this is the same concept. – Akavall Dec 02 '14 at 06:36
  • 61
    For those who come to this later, `index` means both `row` and `column names`, spent 20 minutes on row index but turned out I got duplicated column names that caused this error. – Jia Gao Oct 06 '18 at 18:29
  • To add to this, I came across this error when I tried to reindex a dataframe on a list of columns. Oddly enough, my duplicate was in my original dataframe, so be sure to check both! – n8-da-gr8 Nov 06 '19 at 11:33
  • Thanks @JasonGoal, I had duplicates in index itself. Dropped on index source (before building DF) with `drop_duplicates`. – Denis Dec 27 '21 at 22:36
  • 2
    I came across this error because I appended dataframes together, then tried copying one column after modifying the others. The solution was to `reset_index(drop=True)` after appending the dataframes. – DarkHark Jan 12 '23 at 16:31
  • This sort of assumes dropping based on the duplicate index in the right thing to do. The index might be, for instance, timestamps. – user48956 May 17 '23 at 07:20
253

As others have said, you've probably got duplicate values in your original index. To find them do this:

df[df.index.duplicated()]

Matthew
  • 10,361
  • 5
  • 42
  • 54
68

Indices with duplicate values often arise if you create a DataFrame by concatenating other DataFrames. IF you don't care about preserving the values of your index, and you want them to be unique values, when you concatenate the the data, set ignore_index=True.

Alternatively, to overwrite your current index with a new one, instead of using df.reindex(), set:

df.index = new_index
Rebeku
  • 829
  • 6
  • 4
  • 17
    I used ignore_index=True to get my code to work with concatenated dataframes – Gabi Lee Jul 08 '18 at 11:25
  • 2
    Indeed, `ignore_index=False` is the default; if using the option is to change `append`'s behavior at all, it will have to be because you set it to `True`. – Jeffrey Benjamin Brown Jun 20 '19 at 20:06
  • I spent 10 hours trying to figure out my error and your answer helped me. I was concatenating two dataframes and looking to the df.tail() to see the last index. The fact is that the index was duplicating. – Isac Moura Jun 25 '20 at 22:32
  • 1
    I think this should be the accepted answer as it not only provides a reason for the error but also a workable solution. – Jio Apr 20 '21 at 08:25
  • 9
    what is `new_index` ? – dcsan Jun 05 '21 at 08:17
51

Simple Fix

Run this before grouping

df = df.reset_index()

Thanks to this github comment for the solution.

Connor
  • 4,216
  • 2
  • 29
  • 40
43

For people who are still struggling with this error, it can also happen if you accidentally create a duplicate column with the same name. Remove duplicate columns like so:

df = df.loc[:,~df.columns.duplicated()]
Parseltongue
  • 11,157
  • 30
  • 95
  • 160
  • 3
    Above will delete **all** columns with duplicates, to keep one column use the keep parameter: `df.loc[:,~df.columns.duplicated(keep='first')]` https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.duplicated.html – closedloop Feb 04 '21 at 12:01
  • 1
    Thank you...That was helpful for me today. – HassanSh__3571619 Mar 10 '21 at 05:44
  • `keep='first' or 'last'` does not delete all duplicated values! It keeps one (depending on what you have specified in keep) and deletes the rest. To delete all duplications, you should use `keep='False'` – Phoenix Jun 16 '22 at 20:56
22

Simply skip the error using .values at the end.

affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0).values
Hadij
  • 3,661
  • 5
  • 26
  • 48
  • 1
    This is _exactly_ what I needed! Just trying to create a new column, but I had a index with duplicates in it. Using `.values` did the trick – Paul Wildenhain Jan 02 '20 at 22:03
  • 1
    Finally, I found the only answer which actually works! The other answers state the problem but don't give an actual answer as to how to fix it. – Lecdi Apr 12 '22 at 19:15
11

I came across this error today when I wanted to add a new column like this

df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)

I wanted to process the REMARK column of df_temp to return 1 or 0. However I typed wrong variable with df. And it returned error like this:

----> 1 df_temp['REMARK_TYPE'] = df.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)

/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
   2417         else:
   2418             # set column
-> 2419             self._set_item(key, value)
   2420 
   2421     def _setitem_slice(self, key, value):

/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _set_item(self, key, value)
   2483 
   2484         self._ensure_valid_index(value)
-> 2485         value = self._sanitize_column(key, value)
   2486         NDFrame._set_item(self, key, value)
   2487 

/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value, broadcast)
   2633 
   2634         if isinstance(value, Series):
-> 2635             value = reindexer(value)
   2636 
   2637         elif isinstance(value, DataFrame):

/usr/lib64/python2.7/site-packages/pandas/core/frame.pyc in reindexer(value)
   2625                     # duplicate axis
   2626                     if not value.index.is_unique:
-> 2627                         raise e
   2628 
   2629                     # other

ValueError: cannot reindex from a duplicate axis

As you can see it, the right code should be

df_temp['REMARK_TYPE'] = df_temp.REMARK.apply(lambda v: 1 if str(v)!='nan' else 0)

Because df and df_temp have a different number of rows. So it returned ValueError: cannot reindex from a duplicate axis.

Hope you can understand it and my answer can help other people to debug their code.

GoingMyWay
  • 16,802
  • 32
  • 96
  • 149
6

In my case, this error popped up not because of duplicate values, but because I attempted to join a shorter Series to a Dataframe: both had the same index, but the Series had fewer rows (missing the top few). The following worked for my purposes:

df.head()
                          SensA
date                           
2018-04-03 13:54:47.274   -0.45
2018-04-03 13:55:46.484   -0.42
2018-04-03 13:56:56.235   -0.37
2018-04-03 13:57:57.207   -0.34
2018-04-03 13:59:34.636   -0.33

series.head()
date
2018-04-03 14:09:36.577    62.2
2018-04-03 14:10:28.138    63.5
2018-04-03 14:11:27.400    63.1
2018-04-03 14:12:39.623    62.6
2018-04-03 14:13:27.310    62.5
Name: SensA_rrT, dtype: float64

df = series.to_frame().combine_first(df)

df.head(10)
                          SensA  SensA_rrT
date                           
2018-04-03 13:54:47.274   -0.45        NaN
2018-04-03 13:55:46.484   -0.42        NaN
2018-04-03 13:56:56.235   -0.37        NaN
2018-04-03 13:57:57.207   -0.34        NaN
2018-04-03 13:59:34.636   -0.33        NaN
2018-04-03 14:00:34.565   -0.33        NaN
2018-04-03 14:01:19.994   -0.37        NaN
2018-04-03 14:02:29.636   -0.34        NaN
2018-04-03 14:03:31.599   -0.32        NaN
2018-04-03 14:04:30.779   -0.33        NaN
2018-04-03 14:05:31.733   -0.35        NaN
2018-04-03 14:06:33.290   -0.38        NaN
2018-04-03 14:07:37.459   -0.39        NaN
2018-04-03 14:08:36.361   -0.36        NaN
2018-04-03 14:09:36.577   -0.37       62.2
tehfink
  • 447
  • 8
  • 7
  • Thank you! I had become accustomed to filtering and later merging DataFrames and Series' like so: `df_larger_dataframe['values'] = df_filtered_dataframe['filtered_values']` and it hasn't been working lately on TimeSeries - your code solved it! – tw0000 Jun 26 '18 at 16:48
3

I wasted couple of hours on the same issue. In my case, I had to reset_index() of a dataframe before using apply function. Before merging, or looking up from another indexed dataset, you need to reset the index as 1 dataset can have only 1 Index.

rishi jain
  • 1,524
  • 1
  • 19
  • 26
1

I got this error when I tried adding a column from a different table. Indeed I got duplicate index values along the way. But it turned out I was just doing it wrong: I actually needed to df.join the other table.

This pointer might help someone in a similar situation.

Michel de Ruiter
  • 7,131
  • 5
  • 49
  • 74
  • Thank you! I was having a hard time trying to add a column as you mentioned from the same table, but with different row/column combinations. I realized the index was duplicated but just wanted that to be ignored in appending a new column... your answer made me realize `df.join` was the way to go. – El- Mar 17 '21 at 14:12
1

Just add .to_numpy() to the end of the series you want to concatenate.

  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 31 '21 at 08:57
1

In my case it was caused by mismatch in dimensions:

accidentally using a column from different df during the mul operation

Neil
  • 7,482
  • 6
  • 50
  • 56
0

This can also be a cause for this[:) I solved my problem like this]

It may happen even if you are trying to insert a dataframe type column inside dataframe

you can try this

df['my_new']=pd.Series(my_new.values)
Rohit gupta
  • 211
  • 3
  • 18
0

if you get this error after merging two dataframe and remove suffix adnd try to write to excel Your problem is that there are columns you are not merging on that are common to both source DataFrames. Pandas needs a way to say which one came from where, so it adds the suffixes, the defaults being '_x' on the left and '_y' on the right.

If you have a preference on which source data frame to keep the columns from, then you can set the suffixes and filter accordingly, for example if you want to keep the clashing columns from the left:

# Label the two sides, with no suffix on the side you want to keep
df = pd.merge(
    df, 
    tempdf[what_i_care_about], 
    on=['myid', 'myorder'], 
    how='outer',
    suffixes=('', '_delete_suffix')  # Left gets no suffix, right gets something identifiable
)
# Discard the columns that acquired a suffix
df = df[[c for c in df.columns if not c.endswith('_delete_suffix')]]

Alternatively, you can drop one of each of the clashing columns prior to merging, then Pandas has no need to assign a suffix.

camille
  • 16,432
  • 18
  • 38
  • 60
0

It happened to me when I appended 2 dataframes into another (df3 = df1.append(df2)), so the output was:

df1
    A   B
0   1   a
1   2   b
2   3   c

df2
    A   B
0   4   d
1   5   e
2   6   f

df3
    A   B
0   1   a
1   2   b
2   3   c
0   4   d
1   5   e
2   6   f

The simplest way to fix the indexes is using the "df.reset_index(drop=bool, inplace=bool)" method, as Connor said... you can also set the 'drop' argument True to avoid the index list to be created as a columns, and 'inplace' to True to make the indexes reset permanent.

Here is the official refference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html

In addition, you can also use the ".set_index(keys=list, inplace=bool)" method, like this:

new_index_list = list(range(0, len(df3)))
df3['new_index'] = new_index_list 
df3.set_index(keys='new_index', inplace=True)

official refference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html

0

Make sure your index does not have any duplicates, I simply did df.reset_index(drop=True, inplace=True) and I don't get the error anymore! But you might want to keep the index, in that case just set drop to False

Ali Ahmed
  • 5
  • 3
0

df = df.reset_index(drop=True) worked for me

pleo
  • 11
  • 3
  • 3
    Please, consider adding some explanation. Code-only answer are discouraged. Why it works? How is it different from some of the already upvoted 8 years-old answers? – chrslg Dec 07 '22 at 11:53
0

I was trying to create a histogram using seaborn.

sns.histplot(data=df, x='Blood Chemistry 1', hue='Outcome', discrete=False, multiple='stack')

I get ValueError: cannot reindex from a duplicate axis. To solve it, I had to choose only the rows where x has no missing values:

data = df[~df['Blood Chemistry 1'].isnull()]