I'm trying to build a machine learning dataset based on a mix of qualitative and quantitative data I have stored in a csv file with 4.4 million instances. I'm basically using get_dummies on the qualitative columns and concatenating them together with the unmodified quantitative columns into a larger dataframe that I then write to a csv file. The only problem is that when I write the new dataframe to the csv file, it's too big to even read. Based off file size, though, I'd guess 35+ million instances. I've checked the dimensions of the individual "columns" that I concatenated together and they're all 4.4 million instances long and at most 14 categories wide. Even the final dataframe that I write to a csv is only 4.4 instances million long and 400 categories wide, but when I write it to the csv file, the file is 35+ million long. Does anybody know why this happens?
for name in list(df_data.columns):
df_class = df_data[[name]]
# handles quantitative
if type(df_class.values[0][0]) == type(foo_int) or type(df_class.values[0][0]) == type(foo_float):
df_new = pd.concat([df_new, df_class], axis=1, sort=False)
print('Done with ' + name)
print(df_class.shape)
# handles qualitative
elif type(df_class.values[0][0]) == type(foo_str):
df_class = pd.get_dummies(df_class)
df_new = pd.concat([df_new, df_class], axis=1, sort=False)
print('Done with ' + name)
print(df_class.shape)