I am working on a data pipeline in Airflow, and keep running into this ValueError: cannot reindex from a duplicate axis
that I've been beating my head against for days.
Here is the function that is messing up:
def fill_missing_dates(df):
df['TUNING_EVNT_START_DT'] = pd.to_datetime(df['TUNING_EVNT_START_DT'])
dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
masdiv = df['MASDIV'].unique()
station = df['STATION'].unique()
idx = pd.MultiIndex.from_product((dates, masdiv, station), names=['TUNING_EVNT_START_DT', 'MASDIV', 'STATION'])
df = df.set_index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION']).reindex(idx, fill_value=0).reset_index()
return df
Here is the error output from AWS Cloudwatch logs:
16:31:40
dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 821, in asfreq
16:31:40
return self._upsample("asfreq", fill_value=fill_value)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 1125, in _upsample
16:31:40
res_index, method=method, limit=limit, fill_value=fill_value
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/util/_decorators.py", line 221, in wrapper
16:31:40
return func(*args, **kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3976, in reindex
16:31:40
return super().reindex(**kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4514, in reindex
16:31:40
axes, level, limit, tolerance, method, fill_value, copy
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3864, in _reindex_axes
16:31:40
index, method, copy, level, fill_value, limit, tolerance
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3886, in _reindex_index
16:31:40
allow_dups=False,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4577, in _reindex_with_indexers
16:31:40
copy=copy,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/internals/managers.py", line 1251, in reindex_indexer
16:31:40
self.axes[axis]._can_reindex(indexer)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/base.py", line 3362, in _can_reindex
16:31:40
raise ValueError("cannot reindex from a duplicate axis")
16:31:40
ValueError: cannot reindex from a duplicate axis
16:31:40
"""
16:31:40
The above exception was the direct cause of the following exception:
16:31:40
Traceback (most recent call last):
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 275, in <module>
16:31:40
runner(path_prefix, model_name, execution_id, table)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 230, in runner
16:31:40
df = multiprocessing(PROCESSORS, df)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 121, in multiprocessing
16:31:40
x = pool.map(iforest, (df.loc[df['MASDIV'] == masdiv] for masdiv in args))
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 268, in map
16:31:40
return self._map_async(func, iterable, mapstar, chunksize).get()
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 657, in get
16:31:40
raise self._value
16:31:40
ValueError: cannot reindex from a duplicate axis
I've ran some logger's to get an idea about the output of the dataframe at that step, but I'm not seeing what the issue points at:
18:40:34
20/02/07 18:40:34 - INFO - __main__ - Where it breaks: df.index(): RangeIndex(start=0, stop=93, step=1)
18:40:34
20/02/07 18:40:34 - INFO - __main__ - Where it breaks: df.columns: Index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION', 'DOW', 'MOY',
18:40:34
'TRANSACTIONS', 'DOW_INT', 'MOY_INT', 'DT_NBR'],
18:40:34
dtype='object')
I have tried everything in these posts, but to no avail:
Pandas error: cannot reindex from a duplicate axis
What does `ValueError: cannot reindex from a duplicate axis` mean?
I am not entirely sure I understand why this is occuring either. Any suggestions are much appreciated.