How can I efficiently read only some of the columns of a parquet file that is hosted in a cloud blob storage (e.g. S3 / Azure Blob Storage)?
The columnar structure is one of the parquet file format's key advantages so that reading columns selectively can reduce I/O load. It's also natural to store data in blob storages for running large-scale workloads on the cloud. However, once a parquet file is stored as a blob, most libraries (dask, fastparquet, pyarrow) can't really take advantage of this since the underlying fseek
is not really possible directly on the blobs - meaning that regardless of which columns are selected one would have to download the entire file to a local file system before reading it.
What is therefore the best practice if my use-case is such that separate applications require different columns, and the performance cost of downloading entire files for just a few columns is unacceptable? Should I be storing different parquet files for each columns along with a common index and then merging at the application level using pandas/dask etc? Does the apache parquet format have some built-in support for splitting a dataset by columns - similar to how the hive format splits by division and then by partition?
Any help / concrete example using dask or fastparquet is appreciated.