2

Is it in memory?

If so, then it doesn't matter if I import chunk by chunk or not because eventually, when I concatenate them, they'll all be stored in memory.

Does that mean for a large data set, there is no way to use pandas?

NoName
  • 9,824
  • 5
  • 32
  • 52
  • for a large dataset depending on how large the data is , you should look at `pyspark` or `dask` – anky Feb 02 '20 at 07:09
  • @anky_91 Okay, but can you answer the 2 questions I posted as well? It'll help my understanding. – NoName Feb 02 '20 at 07:13
  • is there any other place to store data when the program is still manipulating it, for whichever program? – deadvoid Feb 02 '20 at 07:43
  • @deadvoid Yeee, partially on-disk, importing necessary parts automatically when needed. But to the user, it'll feel like its all on memory. AKA automatic chunking. – NoName Feb 02 '20 at 07:48
  • i see, but unless the storage is nvme it's not going to feel the same since memory speed/bandwidth far exceeds any disks. – deadvoid Feb 02 '20 at 07:53

2 Answers2

5

Yes, they will be stored in memory, and that's the reason why you want to chunk them - that allows you to not read the whole data set in at the same time, but process it in chunks before writing out the end result.

You can use chunksize to tell pandas how many rows should be read for each chunk. If you need a complete set of rows to perform arbitrary lookups, you'll have to back it with some other technology (such as a database).

MatsLindh
  • 49,529
  • 4
  • 53
  • 84
  • So you mean I should finish processing the current chunk before importing another? The problem is, Python doesn't use manual garbage collection, how exactly do I "free" a DataFrame before importing the next chunk? – NoName Feb 02 '20 at 07:30
  • If you're using chunking in pandas python should be able to handle it automagically for you as the chunk goes out of scope (and it's necessary to free up the space). – MatsLindh Feb 02 '20 at 07:31
  • Not sure how I could do that without losing which chunk I'm on. I'm currently using `for chunk in pd.read_fwf(..., chunksize = ...): temp_df = chunk` Do you mean when a chunk is no longer being referenced by a variable (`temp_df`) it'll be instantly freed? So I don't have to break out of that for-loop? – NoName Feb 02 '20 at 07:40
  • @NoName I wouldn’t think about these topics until it becomes a problem. Garbage collection is a large and complex issue, but for your purposes, as soon as something is no longer referenced then it will go away. Maybe not instantly, but eventually. – Boris the Spider Feb 02 '20 at 08:19
  • @BoristheSpider If Python doesn't garbage collect after each iteration, I'm not sure how I can set the 'chunksize' since there could be multiple chunks in memory still and make the program crash on next iteration of import. My dataset is too large to import at once, that's why I'm asking. – NoName Feb 02 '20 at 08:30
  • If you're still keeping a live reference to your data outside of the loop (for example appending everything to a list that you keep in memory), the memory can't be freed. If you use `chunksize` you tell pandas "give me this many rows each time, and I'll process them". When you get the next chunk, the previous chunk should no longer be kept around, and Python can free the memory _when needed_. It won't necessarily happen for the next iteration, but when more memory is actually needed. Think about is as reading a file - you can process what you've read without reading everything. – MatsLindh Feb 02 '20 at 09:40
1

Yes it is in memory, and yes when the dataset gets too large you have to use other tools. Of course you can load data in chucks, process one chunk at a time and write down the results (and so free memory for the next chunk). That works fine for some type of process like filtering and annotating while if you need sorting or grouping you need to use some other tool, personally I like bigquery from google cloud.

Marco
  • 1,952
  • 1
  • 17
  • 23