0

I'm trying to build a machine learning dataset based on a mix of qualitative and quantitative data I have stored in a csv file with 4.4 million instances. I'm basically using get_dummies on the qualitative columns and concatenating them together with the unmodified quantitative columns into a larger dataframe that I then write to a csv file. The only problem is that when I write the new dataframe to the csv file, it's too big to even read. Based off file size, though, I'd guess 35+ million instances. I've checked the dimensions of the individual "columns" that I concatenated together and they're all 4.4 million instances long and at most 14 categories wide. Even the final dataframe that I write to a csv is only 4.4 instances million long and 400 categories wide, but when I write it to the csv file, the file is 35+ million long. Does anybody know why this happens?

         for name in list(df_data.columns):
            df_class = df_data[[name]]
            # handles quantitative
            if type(df_class.values[0][0]) == type(foo_int) or type(df_class.values[0][0]) == type(foo_float):
                df_new = pd.concat([df_new, df_class], axis=1, sort=False)
                print('Done with ' + name)
                print(df_class.shape)
            # handles qualitative
            elif type(df_class.values[0][0]) == type(foo_str):
                df_class = pd.get_dummies(df_class)
                df_new = pd.concat([df_new, df_class], axis=1, sort=False)
                print('Done with ' + name)
                print(df_class.shape)
DennisLi
  • 3,915
  • 6
  • 30
  • 66
  • 1
    You should also share the code you're using to save to `.csv`. – cwalvoort Mar 15 '20 at 22:11
  • [Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Mar 15 '20 at 22:47
  • By the way, ideally what is your code supposed to do? What is desired result? Does this code generate the CSV? Please clearly elaborate your input data, actual code process, and current and desired result. – Parfait Mar 15 '20 at 22:51
  • ```df_new.to_csv('dataset.csv', index=False) ``` The code in my original post is building the new data frame and this line right here writes it to a csv file. The random increase in size only happens when I use the whole original dataframe in its entirety (all 4.4 million instances) but not when I only use 100,000 instances. – AsianTemptation Mar 15 '20 at 23:48
  • in this data size, you should consider using multiprocessing or pyspark – DennisLi Mar 16 '20 at 01:51

0 Answers0