0

I am trying to use python to turn binary files into pandas DataFrames for easy subsetting and data analysis. My package works, but only for small files ('small' meaning ~500mb). A workable example of the final bits of the code is shown below:

import pandas as pd

list_of_dicts = [{'a': 1, 'b': 2, 'c': 3},{'a': 1, 'b': 2, 'c': 3},{'a': 1, 'b': 2, 'c': 3}]
output = pd.DataFrame(list_of_dicts)   # Memory error occurs here for large files

I can reduce the size of the DataFrame by about 40-50% using .astype('float32'), but I need the dtype to be set to float32 before the DataFrame is built, not afterwards since the memory error occurs during the creation of the DataFrame. Is there a way of changing the default dtype of pd.DataFrame() to use float32 instead of float64 and int64?

Sal
  • 37
  • 1
  • 6
  • If you are hitting the memory limit, you should considere processing less data or not using pandas... Any further processing could raise the error. – Serge Ballesta Mar 16 '20 at 15:23
  • https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas – AMC Mar 16 '20 at 21:09
  • How exactly are you reading the data? – AMC Mar 16 '20 at 21:09
  • Currently I'm opening the file with `read`, storing a line of information in a dict, appending the dict to a list, then looping to the next line. This gives a list of dicts where each key:value pair in the dict corresponds to a column name and value, and each member of the list corresponds to a row of data. I then call `pd.DataFrame()` on it and specify the columns I want. The reason I do this is that I need to be able to subset and modify columns of data, as well as extract subsets of columns for fitting and making graphs. – Sal Mar 17 '20 at 07:16

0 Answers0