Pandas Dataframe explodes in size when written to CSV File

Question

I'm trying to build a machine learning dataset based on a mix of qualitative and quantitative data I have stored in a csv file with 4.4 million instances. I'm basically using get_dummies on the qualitative columns and concatenating them together with the unmodified quantitative columns into a larger dataframe that I then write to a csv file. The only problem is that when I write the new dataframe to the csv file, it's too big to even read. Based off file size, though, I'd guess 35+ million instances. I've checked the dimensions of the individual "columns" that I concatenated together and they're all 4.4 million instances long and at most 14 categories wide. Even the final dataframe that I write to a csv is only 4.4 instances million long and 400 categories wide, but when I write it to the csv file, the file is 35+ million long. Does anybody know why this happens?

         for name in list(df_data.columns):
            df_class = df_data[[name]]
            # handles quantitative
            if type(df_class.values[0][0]) == type(foo_int) or type(df_class.values[0][0]) == type(foo_float):
                df_new = pd.concat([df_new, df_class], axis=1, sort=False)
                print('Done with ' + name)
                print(df_class.shape)
            # handles qualitative
            elif type(df_class.values[0][0]) == type(foo_str):
                df_class = pd.get_dummies(df_class)
                df_new = pd.concat([df_new, df_class], axis=1, sort=False)
                print('Done with ' + name)
                print(df_class.shape)

You should also share the code you're using to save to `.csv`. — cwalvoort, Mar 15 '20 at 22:11
[Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) — Parfait, Mar 15 '20 at 22:47
By the way, ideally what is your code supposed to do? What is desired result? Does this code generate the CSV? Please clearly elaborate your input data, actual code process, and current and desired result. — Parfait, Mar 15 '20 at 22:51
```df_new.to_csv('dataset.csv', index=False) ``` The code in my original post is building the new data frame and this line right here writes it to a csv file. The random increase in size only happens when I use the whole original dataframe in its entirety (all 4.4 million instances) but not when I only use 100,000 instances. — AsianTemptation, Mar 15 '20 at 23:48
in this data size, you should consider using multiprocessing or pyspark — DennisLi, Mar 16 '20 at 01:51

Pandas Dataframe explodes in size when written to CSV File

0 Answers0