how can i read a specific page of a parquet file? use python or java

Question

Now I know how to read a parquet file in row group granularity. such as:

import pyarrow.parquet as pp
_table = pp.ParquetFile(file)
row_group_contents = _table.read_row_group(row_group_index, columns=[column])

but i want to read in page granularity. how can i do that?

score 0 · Answer 1 · answered May 31 '22 at 08:45

0

You can use the filters parameter like:

filters=[('column', '=', 'value')]

some useful sources:

answered May 31 '22 at 08:45

Glauco

Thank you very much. But I wasn't trying to filter the data. As far as I know, parquet file currently provides index at page granularity.（https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/） I would like to know if there is a way to locate a page directly. For example, read the 2nd page of the 1st column of the 1st row group directly. – jiazhen Liu May 31 '22 at 08:57

1 Answers1