0

I just discovered Parquet and it met my "big" data processing / (local) storage needs:

  • faster than relational databases, which are designed to run over the network (creating overhead) and just aren't as fast as a solution designed for local storage
  • compared to JSON or CSV: good for storing data efficiently into types (instead of everything being a string) and can read specific chunks from the file more dynamically than JSON or CSV

But to my dismay while Node.js has a fully functioning library for it, the only Parquet lib for Python seems to be quite literally a half-measure:

parquet-python is a pure-python implementation (currently with only read-support) of the parquet format ... Not all parts of the parquet-format have been implemented yet or tested e.g. nested data

So what gives? Is there something better than Parquet already supported by Python that lowers interest in developing a library to support it? Is there some close alternative?

J.Todd
  • 707
  • 1
  • 12
  • 34

1 Answers1

4

Actually, you can read and write parquet with pandas which is commonly use for data jobs (not ETL on big data tho). For handling parquet pandas use two common packages:

pyarrow is a cross-platform tool providing columnar format for memory. Parquet is also a columnar format, it has support for it though it has variety of formats and it is a broader lib.

fastparquet is solely designed to focus on parquet format to use on process for python-based bigdata flows.

null
  • 1,944
  • 1
  • 14
  • 24
  • This post covers usage. https://stackoverflow.com/questions/33813815/how-to-read-a-parquet-file-into-pandas-dataframe – philosofool Jun 08 '21 at 13:32