1

Now I know how to read a parquet file in row group granularity. such as:

import pyarrow.parquet as pp
_table = pp.ParquetFile(file)
row_group_contents = _table.read_row_group(row_group_index, columns=[column])

but i want to read in page granularity. how can i do that?

1 Answers1

0

You can use the filters parameter like:

filters=[('column', '=', 'value')]

some useful sources:

official docs

stack overflow post

Glauco
  • 1,385
  • 2
  • 10
  • 20
  • Thank you very much. But I wasn't trying to filter the data. As far as I know, parquet file currently provides index at page granularity.(https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/) I would like to know if there is a way to locate a page directly. For example, read the 2nd page of the 1st column of the 1st row group directly. – jiazhen Liu May 31 '22 at 08:57