0

After reading dataset with filters in dataset.fragments other values of filtered column is presented. Is this the expected behavior?

import pyarrow.parquet as pq
from pyarrow import csv

path_ds = 'path/to/ds/'
path_csv = 'path/to/csv/'

read_options = csv.ReadOptions(autogenerate_column_names=True)
parse_options = csv.ParseOptions(delimiter='|')

with csv.open_csv(path_csv, parse_options=parse_options, read_options=read_options) as reader:
    for chunk in reader:
        tbl = pa.Table.from_batches([chunk])

        pq.write_to_dataset(
           tbl,
           root_path=path_ds,
           partition_cols=['f0', 'f2'],
           use_legacy_dataset=False
        )

temp_dataset = pq.ParquetDataset(
    path_ds,
    use_legacy_dataset=False,
    filters=[('f0', '=', '01.09.2022'), ('f2', '=', 'code1')]
)
print(temp_dataset.fragments)

>>> [<pyarrow.dataset.ParquetFileFragment path=path/to/ds/f0=01.09.2022/f2=code1/008f64795a3640f3a5cab0273fc287b1-0.parquet partition=[f0=01.09.2022, f2='code1']>,
>>> ...
>>> <pyarrow.dataset.ParquetFileFragment path=path/to/ds/f0=01.09.2022/f2=code2/5c1225fae02a4226b62f3959f6a57cf0-0.parquet partition=[f0=01.09.2022, f2='code2']>,
>>> ...

1 Answers1

1

According to the doc

Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).

It means if you want to filter the data based on f0 and f2, you need to do: filters=[[('f0', '=', '01.09.2022'), ('f2', '=', 'code1')]] (note the extra [])

0x26res
  • 11,925
  • 11
  • 54
  • 108
  • thank you for answer! according to [this](https://stackoverflow.com/a/64394428/9053066) without extra `[]` should give same **and** result. so I tried both variants and get the same result. and `dataset.fragments` still the same. – Bulat Ibragimov Nov 17 '22 at 14:00