1

I need to analysis data, but file is 9Gb. When I try to open it Python is interrupted and return MemoryError.

data = pd.read_csv("path.csv") Is there any way to solve this problem or I should drop this file?

  • Do you need to open the entire file? you can pass `chunksize` param to `read_csv` which will return a chunk at a time, also are you using 64-bit python, OS etc – EdChum May 05 '16 at 13:02
  • What sort of data does it contain? Maybe a sample line or two would help. And do you need all the data inside it, or just a subset? How much memory do you have on your system? Have you tried a 1 GB subset of this file? Do you have a 64-bit OS? Which OS? – John Zwinck May 05 '16 at 13:02
  • If you'll use your file as a generator (with open(file) as f: for line in f) then you will not have to upload this at once and will be able to do something iteratively. I don't think you will be able to use pandas though, because it assumes you can fit file in memory – trainset May 05 '16 at 13:03
  • windows 64. I need all data from this file. I have free 793 GB. And I have 4 columns –  May 05 '16 at 13:09
  • It works to file size 1.5GB –  May 05 '16 at 13:17
  • If you are not restricted to `pandas`, you can use [`sframe`](https://github.com/dato-code/SFrame), which is disk-based and thus giving you the possibility to hold datasets that are too large to fit in your system's memory. – iulian May 05 '16 at 13:21
  • See http://stackoverflow.com/questions/33642951/python-using-pandas-structures-with-large-csviterate-and-chunksize - try using `chunksize` and `concat` as in the answer there. – John Zwinck May 05 '16 at 13:37
  • @JohnZwinck, i don't think it's a good idea - you'll need more memory for this, compared to reading the whole CSV file into DF in one shot. – MaxU - stand with Ukraine May 05 '16 at 14:01
  • @user6241246, first of all if you can somehow reduce amount of data - you should do it. For example if you don't need __all__ columns for your analysis, you can read only _interesting_ columns using `usecols=['colA','colD']` parameter - this will reduce amount of memory needed for your DF. Beside that if you can do your analysis - chunk-by-chunk, you can use `chunksize` parameter (as it has already been mentioned) and process your data by portions. If nothing else helps you may consider using `Spark SQL` on clustered environment – MaxU - stand with Ukraine May 05 '16 at 14:17

1 Answers1

1

As mentioned by EdChum, I use chunksize=n to open big files in chunks, then loop through the chunks to do whatever you need. Specify the number of rows you want in each 'chunk' of data and open as follows:

chunks = 100000
data = pd.read_csv("path.csv", chunksize=chunks)
for chunk in data:
    print "something"

Hope this helps :)

EllieFev
  • 93
  • 7