1

JSON isn't necessarily a high efficiency structure to store data in terms of bytes of overhead and parsing. There's a logical parsing structure, for example, based on syntax rather than being able to look up a specific segment. Let's say you have 20 years of timestep data, ~ 1TB compressed and want to be able to store it efficiently and load / store it as fast as possible for maximum speed simulation.

  • At first I tried relational databases, but those are actually not that fast - they're designed to load over a network, not locally, and the OSI model has overhead.

  • I was able to speed this up by creating a custom binary data structure with defined block sizes and header indexes, sort of like a file system, but this was time consuming and highly specified for a single type of data, for example fixed length data nodes. Editing the data wasn't a feature, it was a one time export spanning days of time. I'm sure some library could do it better.

  • I learned about Pandas, but they seem to load to / from CSV and JSON most commonly, and both of those are plain-text, so storing an int takes the space of multiple characters rather than having the power of deciding a 32 bit unsigned int for example.

What's the right tool? Can Pandas do this, or is there something better?

  • I need to be able to specify data type for each property being stored so if I only need a 16 bit int, thats the space that gets used.
  • I need to be able to use stream to read / write from big (1-10TB) data as fast as fundamentally possible per the hardware..
J.Todd
  • 707
  • 1
  • 12
  • 34
  • Have you looked into parquet? See [documentation here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet). SO questions: [How to read a Parquet file into Pandas DataFrame?](https://stackoverflow.com/questions/33813815/how-to-read-a-parquet-file-into-pandas-dataframe) and [Python: save pandas data frame to parquet file](https://stackoverflow.com/questions/41066582/python-save-pandas-data-frame-to-parquet-file) – ernest_k Dec 17 '20 at 09:54
  • @ernest_k thanks. no, but I'm definitely listening to a presentation on it now. Looks like Twitter made it, interesting. I wonder what Twitter gains by having their devs / project leaders go to tech tradeshows and show off their data solution. Fun for the devs? Satisfying work environment = developer retention? Or is there a business benefit from open sourcing / advertising their lib? – J.Todd Dec 17 '20 at 10:02
  • Well, a great deal of open source projects came from companies like Twitter (and speaking of Twitter, there are a few projects that came from there). Many great organizations do that - open source has practical reasons beyond prestige :) - and yes, sometimes that includes sharing of an overview of their implementation architecture. – ernest_k Dec 17 '20 at 10:12

0 Answers0