0

I basically have a large (multi-terabyte) dataset of text (it's in JSON but I could change it to dict or dataframe). It has multiple keys, such as "group" and "user".

Right now I'm filtering the data by reading through the entire text for these keys. It would be far more efficient to have a structure where I filter and read only the key.

Doing the above would be trivial if it fit in memory, and I could use standard dict/pandas methods and hash tables. But it doesn't fit in memory.

There must be an off the shelf system for this. Can anyone recommend one?

There are discussions about this, but some of the better ones are old. I'm looking for the simplest off the shelf solution.

  • Found this thread which seems good, but it's 5 years old and I'm not sure which solution best applies to my case. https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas?rq=1 – Courtney Kristensen Jun 22 '18 at 16:55
  • Maybe creating a HDF5-File (h5py or pytables) would help? – max9111 Jul 06 '18 at 12:43

1 Answers1

1

I suggest you to split your large file to multiple small files with method readlines(CHUNK) and then you can process it one by one. I worked with large Json and at beginning, the process was 45sec by file and my program ran while 2 days but when I splintered it, the program finished only for 4h

Ctrl
  • 42
  • 4