JSON isn't necessarily a high efficiency structure to store data in terms of bytes of overhead and parsing. There's a logical parsing structure, for example, based on syntax rather than being able to look up a specific segment. Let's say you have 20 years of timestep data, ~ 1TB compressed and want to be able to store it efficiently and load / store it as fast as possible for maximum speed simulation.
At first I tried relational databases, but those are actually not that fast - they're designed to load over a network, not locally, and the OSI model has overhead.
I was able to speed this up by creating a custom binary data structure with defined block sizes and header indexes, sort of like a file system, but this was time consuming and highly specified for a single type of data, for example fixed length data nodes. Editing the data wasn't a feature, it was a one time export spanning days of time. I'm sure some library could do it better.
I learned about Pandas, but they seem to load to / from CSV and JSON most commonly, and both of those are plain-text, so storing an int takes the space of multiple characters rather than having the power of deciding a 32 bit unsigned int for example.
What's the right tool? Can Pandas do this, or is there something better?
- I need to be able to specify data type for each property being stored so if I only need a 16 bit int, thats the space that gets used.
- I need to be able to use stream to read / write from big (1-10TB) data as fast as fundamentally possible per the hardware..