Extra feature values in dataset fragments

Question

After reading dataset with filters in dataset.fragments other values of filtered column is presented. Is this the expected behavior?

import pyarrow.parquet as pq
from pyarrow import csv

path_ds = 'path/to/ds/'
path_csv = 'path/to/csv/'

read_options = csv.ReadOptions(autogenerate_column_names=True)
parse_options = csv.ParseOptions(delimiter='|')

with csv.open_csv(path_csv, parse_options=parse_options, read_options=read_options) as reader:
    for chunk in reader:
        tbl = pa.Table.from_batches([chunk])

        pq.write_to_dataset(
           tbl,
           root_path=path_ds,
           partition_cols=['f0', 'f2'],
           use_legacy_dataset=False
        )

temp_dataset = pq.ParquetDataset(
    path_ds,
    use_legacy_dataset=False,
    filters=[('f0', '=', '01.09.2022'), ('f2', '=', 'code1')]
)
print(temp_dataset.fragments)

>>> [<pyarrow.dataset.ParquetFileFragment path=path/to/ds/f0=01.09.2022/f2=code1/008f64795a3640f3a5cab0273fc287b1-0.parquet partition=[f0=01.09.2022, f2='code1']>,
>>> ...
>>> <pyarrow.dataset.ParquetFileFragment path=path/to/ds/f0=01.09.2022/f2=code2/5c1225fae02a4226b62f3959f6a57cf0-0.parquet partition=[f0=01.09.2022, f2='code2']>,
>>> ...

score 1 · Answer 1 · answered Nov 17 '22 at 12:27

According to the doc

Predicates are expressed in disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).

It means if you want to filter the data based on f0 and f2, you need to do: filters=[[('f0', '=', '01.09.2022'), ('f2', '=', 'code1')]] (note the extra [])

thank you for answer! according to [this](https://stackoverflow.com/a/64394428/9053066) without extra `[]` should give same **and** result. so I tried both variants and get the same result. and `dataset.fragments` still the same. — Bulat Ibragimov, Nov 17 '22 at 14:00

Extra feature values in dataset fragments

1 Answers1