Pandas dataframe "ValueError: cannot reindex from a duplicate axis" duplicate indices brute force solution?

Question

import pandas as pd

df_avocado = pd.read_csv("avocado.csv")
df_avocado.set_index("Date", inplace=True)

Issue is here:

'''
determines all unique regions (ex: "Alabama", "Alaska", "Arkansas") in dataframe "df_avocado"
finds all data-points belonging to that unique region
dumps those data-points into a temporary dataframe "df_region"
calculates the 25sma of every df_region
dumps the 25sma to "df_avocado_region_25ma" so I can compare 25sma of every region
'''

df_avocado_region_25ma = pd.DataFrame()
for region in df_avocado["region"].unique():
    df_region = df_avocado.copy()[df_avocado["region"] == region]
    df_avocado_region_25ma[f"{region}_25ma"] = df_region["AveragePrice"].rolling(25).mean()

Jupyter gives "ValueError: cannot reindex from a duplicate axis" when adding each df_region to df_avocado_region_25ma.

I looked into what the ValueError means; quoting from What does `ValueError: cannot reindex from a duplicate axis` mean?, "this error usually rises when you join / assign to a column when the index has duplicate values".

This makes sense as a the "date" column (which I set as the index) has a lot of overlapping values. However, since I don't care that there are duplicate indices (they provide a high/low for the 20sma), and I don't want to overwrite a previous index (prefer to include every data point), is there any way to brute force it and add all of the points in?

www.kaggle.com/neuromusic/avocado-prices

import pandas as pd

df_avocado = pd.read_csv("avocado.csv")
wanted_columns = ["Date", "AveragePrice", "region"]
df_avocado = df_avocado[wanted_columns]
df_avocado["Date"] = pd.to_datetime(df_avocado["Date"])
df_avocado.set_index("Date", inplace=True)
df_avocado.sort_index(inplace=True)

df_avocado_region_25ma = pd.DataFrame()
for region in df_avocado["region"].unique():
    df_region = df_avocado.copy()[df_avocado["region"] == region]
    df_avocado_region_25ma[f"{region}_25ma"] = df_region["AveragePrice"].rolling(25).mean()
df_avocado_region_25ma.plot()

Pandas dataframe "ValueError: cannot reindex from a duplicate axis" duplicate indices brute force solution?

0 Answers0