0

I'm attempting to implement my own scaling function for a pandas dataset. The function iterates through each of the columns, calculates the standard deviation and mean, and subsequently standardizes each value. It is encapsulated as a method within my CustomScaler class:

    class CustomScaler:

        def __init__(self, df):
            self.df = df
            self.mean = 0
            self.std = 0

        def _standardize(self, value):
            return (value - self.mean) / self.std

        def transform(self):
            col_length = self.df.columns.values
            for i, column in enumerate(self.df.columns.values):
                self.mean = np.mean(self.df[column])
                self.std = np.std(self.df[column])
                print("Original mean of the column is {}".format(self.mean))
                print("Original standard deviation of the column is {}".format(self.std))

                try:
                    self.df[column] = self.df[column].apply(self._standardize)
                except ValueError as e:
                    print(e)
                    print("Error on column {}".format(column))
                    print(self.df[column].index)

                    duplicates = [item for item, count in Counter(self.df[column].index).items() if count > 1]


                    print(duplicates) # print duplicate indices for debugging

                    break
            return self.df.values

However, I am receiving a ValueError: cannot reindex from a duplicate axis error on one of the columns (let's call it PROBLEM COLUMN). There's clearly duplicate instances of the index within the dataset- the print(duplicates)shows a list of at least 20-30 indices.

My question is why is the error being thrown for only ONE of the columns? If I'm using apply() to iterate through each row of a column, should this error be thrown on the very first column?

After doing some research from other SO posts with similar issues, it's clear that there's duplicate values. I tried to reproduce this error by printing the self.df[column].index values for each column, and they appear to be identical, except for my PROBLEM_COLUMN.

Is it possible for one column in Pandas to have a different set of indices than other columns?

Yu Chen
  • 6,540
  • 6
  • 51
  • 86
  • Why not just use a Scaler class from a library. Here's one: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html – Vivek Kumar Dec 07 '17 at 05:24
  • @VivekKumar StandardScaler throws a MemoryError since my feature space is too large when I call its .fit() function – Yu Chen Dec 07 '17 at 05:26
  • Try it with `StandardScaler(copy=False)` – Vivek Kumar Dec 07 '17 at 05:27
  • @VivekKumar same error. – Yu Chen Dec 07 '17 at 05:28
  • In your current implementation also you are loading whole data into the memory in Dataframe and then using it. Please also try passing `df.values` to `StandardScaler(copy=False)` – Vivek Kumar Dec 07 '17 at 05:29
  • @VivekKumar I’m passing into the constructor my X feature space with my dummified variables included. I believe it is indeed a df. What is the improvement by passing in df.values as opposed to a data frame? – Yu Chen Dec 07 '17 at 05:32
  • `df.values` will be a numpy array on which the copy=False will work. – Vivek Kumar Dec 07 '17 at 05:36

0 Answers0