I have a data set containing 2 billion rows in 9 rows, 1 one contains integers and the other contain strings. The total csv file is around the 80 gb. I'm trying to load the data into a dataframe using read_csv, but the file is to big to read into my memory (I get a memory error). I have around the 150 gb available RAM so it should be no problem. After doing some digging here on the forum I found these 2 possible solutions:
- here they give a solution to do it chunk by chunk, but this process takes a very long time and it still gives me a memory error because the datafile takes more space than the available 150gb in RAM.
df = pd.read_csv('path_to_file', iterator=True, chunksize=100000, dtype=int or string)
dataframe = pd.concat(df, ignore_index=True)
- here they give a solution of specifying the data type for each column using dtype. Specifying them still gives me a memory error (specifying the integer column as int and the other columns as string).
df = pd.read_csv('path_to_file', dtype=int or string)
I also have a hdf file from the same data file, but this one only contains integers. Reading in this hdf file (equal size as the csv file) on both ways specified above still gives me a memory error (exceeding 150 gb of memory).
Is there a quick and memory efficient way of loading this data into a dataframe to process it?
Thanks for the help!