0

I am working on a data pipeline in Airflow, and keep running into this ValueError: cannot reindex from a duplicate axis that I've been beating my head against for days.

Here is the function that is messing up:

def fill_missing_dates(df):
    df['TUNING_EVNT_START_DT'] = pd.to_datetime(df['TUNING_EVNT_START_DT'])
    dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
    masdiv = df['MASDIV'].unique()
    station = df['STATION'].unique()
    idx = pd.MultiIndex.from_product((dates, masdiv, station), names=['TUNING_EVNT_START_DT', 'MASDIV', 'STATION'])
    df = df.set_index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION']).reindex(idx, fill_value=0).reset_index()

    return df

Here is the error output from AWS Cloudwatch logs:

16:31:40
dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 821, in asfreq
16:31:40
return self._upsample("asfreq", fill_value=fill_value)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 1125, in _upsample
16:31:40
res_index, method=method, limit=limit, fill_value=fill_value
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/util/_decorators.py", line 221, in wrapper
16:31:40
return func(*args, **kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3976, in reindex
16:31:40
return super().reindex(**kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4514, in reindex
16:31:40
axes, level, limit, tolerance, method, fill_value, copy
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3864, in _reindex_axes
16:31:40
index, method, copy, level, fill_value, limit, tolerance
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3886, in _reindex_index
16:31:40
allow_dups=False,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4577, in _reindex_with_indexers
16:31:40
copy=copy,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/internals/managers.py", line 1251, in reindex_indexer
16:31:40
self.axes[axis]._can_reindex(indexer)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/base.py", line 3362, in _can_reindex
16:31:40
raise ValueError("cannot reindex from a duplicate axis")
16:31:40
ValueError: cannot reindex from a duplicate axis
16:31:40
"""
16:31:40
The above exception was the direct cause of the following exception:
16:31:40
Traceback (most recent call last):
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 275, in <module>
16:31:40
runner(path_prefix, model_name, execution_id, table)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 230, in runner
16:31:40
df = multiprocessing(PROCESSORS, df)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 121, in multiprocessing
16:31:40
x = pool.map(iforest, (df.loc[df['MASDIV'] == masdiv] for masdiv in args))
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 268, in map
16:31:40
return self._map_async(func, iterable, mapstar, chunksize).get()
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 657, in get
16:31:40
raise self._value
16:31:40
ValueError: cannot reindex from a duplicate axis

I've ran some logger's to get an idea about the output of the dataframe at that step, but I'm not seeing what the issue points at:

18:40:34
20/02/07 18:40:34 - INFO - __main__ - Where it breaks: df.index(): RangeIndex(start=0, stop=93, step=1)
18:40:34
20/02/07 18:40:34 - INFO - __main__ - Where it breaks: df.columns: Index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION', 'DOW', 'MOY',
18:40:34
'TRANSACTIONS', 'DOW_INT', 'MOY_INT', 'DT_NBR'],
18:40:34
dtype='object')

I have tried everything in these posts, but to no avail:

Pandas error: cannot reindex from a duplicate axis

What does `ValueError: cannot reindex from a duplicate axis` mean?

I am not entirely sure I understand why this is occuring either. Any suggestions are much appreciated.

sanjayr
  • 1,679
  • 2
  • 20
  • 41

1 Answers1

1

Without example data I cannot reproduce your error. However, based on the function's name "fill_missing_dates" I think this alternative solution may accomplish what you are trying to achieve.

import pandas as pd

df = pd.DataFrame({
    'date': ["2020-01-01 00:01:00", "2020-01-01 00:02:00", "2020-01-01 01:00:00", "2020-01-01 02:00:00",
             "2020-01-01 00:04:00", "2020-01-01 00:05:00",
             "2020-01-03 00:01:00", "2020-01-03 00:02:00", "2020-01-03 01:00:00", "2020-01-03 02:00:00",
             "2020-01-03 00:04:00", "2020-01-03 00:05:00",
            ],
    'station': ["a","a","a","a","b", "b", "a", "a", "a", "a", "b", "b"],
    'data': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
})

def resampler(x):    
    return x.set_index('date').resample('D').sum()

df['date'] =  pd.to_datetime(df['date'])
multipass = pd.MultiIndex.from_frame(df[["date", "station"]])
df = df.set_index(["date", "station"])
df = df.reindex(multipass)
df.reset_index(level=0).groupby(level=0).apply(resampler)

The result fills in missing dates with 0's:

                        data
station  date   
a        2020-01-01     10
         2020-01-02     0
         2020-01-03     34
b        2020-01-01     11
         2020-01-02     0
         2020-01-03     23
sbraden
  • 479
  • 4
  • 5