4

I tried importing a csv file of size 4GB using pd.read_csv but received out of memory error. Then tried with dask.dataframe, but couldn't convert to pandas dataframe ( same memory error).

import pandas as pd
import dask.dataframe as dd
df = dd.read_csv(#file)
df = df.compute()

Then tried to use the chunksize parameter, but same memory error:

import pandas as pd
df = pd.read_csv(#file, chunksize=1000000, low_memory=False)
df = pd.concat(df)

Also tried using chunksize with lists, same error:

import pandas as pd
list = []
for chunk in pd.read_csv(#file, chunksize=1000000, low_memory=False)
    list.append(chunk)
df = pd.concat(list)

Attempts:

  1. Tried with file size 1.5GB - successfully imported
  2. Tried with file size 4GB - failed (memory error)
  3. Tried with low chunksize (2000 or 50000) - failed (memory error for 4GB file)

Please let me know how to proceed further?

I use python 3.7 and RAM 8GB.

I also tried the Attempt 3 in a server with RAM 128GB, but still memory error

I cannot assign dtype as the csv file to be imported can contain different columns at different time

Community
  • 1
  • 1
user1404
  • 179
  • 1
  • 3
  • 10
  • 3
    reading a chunk and then storing it in a list `list.append(chunk)` doesn't make any sense (because the list is taking your memory). You need to process the chunk as you load it (agregate it, filter it, or whatever) before loading the next one. The 128gb server should, imho, work. My guess is that something is limiting the memory of your process (docker?) – redacted Jun 04 '19 at 09:15
  • I don't want to (agregate it, filter it, or whatever) the chunk, I just want to append all the chunks and build a complete `dataframe` out of it. Post which I will use this entire `dataframe` for filtering and aggregation using `GUI toolkit` – user1404 Jun 04 '19 at 09:27
  • 2
    well you do not have enough memory to do it. Imagine you have a bookshelf with capacity for 100 books (your ram) and you want to fit in 200 books. If you `read_csv()` without chunksize. You take all 200 books at once and place them there - they don't fit. If you set `chunksize` you take 10 books at once, put them there. After 10 rounds you do not have any room in bookshelf left so you run out of memory. – redacted Jun 04 '19 at 09:30
  • To put things simply, I want to build a dataframe out a heavy CSV file without involving any processing. I would like to skip the processing phase because the csv file will not contain the same number of columns at every instance. – user1404 Jun 04 '19 at 09:30
  • @user1404 yes, and as others have stated, you don't have enough memory to materialize the entire dataframe. – juanpa.arrivillaga Jun 04 '19 at 09:31
  • @user1404 What do you want to do with the dataframe in the end? Load it and then exit? – redacted Jun 04 '19 at 09:32
  • 6
    "I also tried the Attempt 3 in a server with RAM 128GB, but still memory error" I find that surprising. Are you using a 32bit version of Python, by chance? – juanpa.arrivillaga Jun 04 '19 at 09:32
  • Thank you very much Robin Nemeth! I clearly understood. I should stop wasting my time in this. – user1404 Jun 04 '19 at 09:33
  • @juanpa.arrivillaga yes I'm using 32Bit version of Python 3.7 – user1404 Jun 04 '19 at 09:34
  • 4
    Then that's your problem, you won't be able to access more than a 32bit address space, so around 4gb *maximum* although many operating systems limit it to less. For example, the limit is 2 gb on windows. – juanpa.arrivillaga Jun 04 '19 at 09:36
  • @RobinNemeth Once I generate the full dataframe, I intend to display it using a GUI toolkit where the end user can perform operations ( like sum, mean, filter, change values, change dtype). – user1404 Jun 04 '19 at 09:36

2 Answers2

0

Already been answered here: How to read a 6 GB csv file with pandas

I also tried the above method with a 2GB file and it works.

Also try to keep the chunk size even smaller.

Can you share the configuration of your system as well, that would be quite useful

  • Welcome to SO, Nischal! I would advise you to read the comments under the OP. We have figured out that lowering the chunk size wouldn't acomplish anything and the issue with 128GB ram server is due to OP using 32bit Python. – redacted Jun 04 '19 at 13:13
0

I just want to record what I tried after getting enough suggestion! Thanks to Robin Nemeth and juanpa

  1. As juanpa pointed I was able to read the csv file (4GB) in the server with 128GB RAM when I used 64bit python executable file

  2. As Robin pointed out even with a 64bit executable I'm not able to read the csv file (4GB) in my local machine with 8GB RAM.

So, no matter what we try the machine's RAM matters as dataframe uses in memory

user1404
  • 179
  • 1
  • 3
  • 10