I'm attempting to implement my own scaling function for a pandas dataset. The function iterates through each of the columns, calculates the standard deviation and mean, and subsequently standardizes each value. It is encapsulated as a method within my CustomScaler
class:
class CustomScaler:
def __init__(self, df):
self.df = df
self.mean = 0
self.std = 0
def _standardize(self, value):
return (value - self.mean) / self.std
def transform(self):
col_length = self.df.columns.values
for i, column in enumerate(self.df.columns.values):
self.mean = np.mean(self.df[column])
self.std = np.std(self.df[column])
print("Original mean of the column is {}".format(self.mean))
print("Original standard deviation of the column is {}".format(self.std))
try:
self.df[column] = self.df[column].apply(self._standardize)
except ValueError as e:
print(e)
print("Error on column {}".format(column))
print(self.df[column].index)
duplicates = [item for item, count in Counter(self.df[column].index).items() if count > 1]
print(duplicates) # print duplicate indices for debugging
break
return self.df.values
However, I am receiving a ValueError: cannot reindex from a duplicate axis error
on one of the columns (let's call it PROBLEM COLUMN
). There's clearly duplicate instances of the index within the dataset- the print(duplicates)
shows a list of at least 20-30 indices.
My question is why is the error being thrown for only ONE of the columns? If I'm using apply()
to iterate through each row of a column, should this error be thrown on the very first column?
After doing some research from other SO posts with similar issues, it's clear that there's duplicate values. I tried to reproduce this error by printing the self.df[column].index
values for each column, and they appear to be identical, except for my PROBLEM_COLUMN
.
Is it possible for one column in Pandas to have a different set of indices than other columns?