I want to do equivalent of pandas operation df[df['certain_date'] > '2023-05-26']
. I have gone through almost all the Apache Arrow related answers on this site. I have been trying some combination of is_in compute function here - https://arrow.apache.org/docs/cpp/compute.html but couldn't get it working. Is this even possible to do in C++? Any help would be appreciated.
Asked
Active
Viewed 165 times
1

Abhishek Kumar
- 729
- 6
- 20
1 Answers
0
It's possible and there are a few ways you could go about it. One way is with the Datasets API:
Assuming you're starting with an arrow::Table
tbl
, and are okay ending up with another arrow::Table
with the result, result
:
// 1: Wrap the Table in a Dataset so we can use a Scanner
std::shared_ptr<arrow::dataset::Dataset> dataset =
std::make_shared<arrow::dataset::InMemoryDataset>(tbl);
// 2: Build ScannerOptions for a Scanner to do a basic filter operation
auto options = std::make_shared<arrow::dataset::ScanOptions>();
options->filter = arrow::compute::greater(
arrow::compute::field_ref("a"),
arrow::compute::literal(3)); // Change for your use case
// 3: Build the Scanner
auto builder = arrow::dataset::ScannerBuilder(dataset, options);
auto scanner = builder.Finish();
// 4: Perform the Scan and make a Table with the result
auto result = scanner.ValueUnsafe()->ToTable();
See https://gist.github.com/amoeba/32d93556560c3386c066b40f3d37d987 for a complete source listing.

amoeba
- 4,015
- 3
- 21
- 14