1

I have a data set containing 2 billion rows in 9 rows, 1 one contains integers and the other contain strings. The total csv file is around the 80 gb. I'm trying to load the data into a dataframe using read_csv, but the file is to big to read into my memory (I get a memory error). I have around the 150 gb available RAM so it should be no problem. After doing some digging here on the forum I found these 2 possible solutions:

  1. here they give a solution to do it chunk by chunk, but this process takes a very long time and it still gives me a memory error because the datafile takes more space than the available 150gb in RAM.

df = pd.read_csv('path_to_file', iterator=True, chunksize=100000, dtype=int or string) dataframe = pd.concat(df, ignore_index=True)

  1. here they give a solution of specifying the data type for each column using dtype. Specifying them still gives me a memory error (specifying the integer column as int and the other columns as string).

df = pd.read_csv('path_to_file', dtype=int or string)

I also have a hdf file from the same data file, but this one only contains integers. Reading in this hdf file (equal size as the csv file) on both ways specified above still gives me a memory error (exceeding 150 gb of memory).

Is there a quick and memory efficient way of loading this data into a dataframe to process it?

Thanks for the help!

Community
  • 1
  • 1
Wiedenkje
  • 13
  • 5
  • What are you trying to do after you read it in? If you're having memory errors then any transform will result in a memory error also even after you manage to read it in. – A.Kot Jun 23 '17 at 12:41
  • Well my thoughts are that the total dataframe shouldn't be twice as big, or am i seeing that part wrong? – Wiedenkje Jun 23 '17 at 12:43
  • can you use pyspark? http://spark.apache.org/docs/2.1.0/api/python/index.html – Goodword Jun 23 '17 at 13:41
  • This won't help with the strings, but how big are the integers? Would they fit into `np.int8` or `np.int16`? Specifying just `int` will result in a dtype of `np.int32`, which could be more space than you need. – EFT Jun 23 '17 at 14:22
  • I will look into pyspark, thanks for the suggestion. The integers are limited of size (the refer to a user in different dictionary) so they will fit in np.int16. I will try it with that. – Wiedenkje Jun 23 '17 at 14:29

0 Answers0