1

I want to do equivalent of pandas operation df[df['certain_date'] > '2023-05-26'] . I have gone through almost all the Apache Arrow related answers on this site. I have been trying some combination of is_in compute function here - https://arrow.apache.org/docs/cpp/compute.html but couldn't get it working. Is this even possible to do in C++? Any help would be appreciated.

Abhishek Kumar
  • 729
  • 6
  • 20

1 Answers1

0

It's possible and there are a few ways you could go about it. One way is with the Datasets API:

Assuming you're starting with an arrow::Table tbl, and are okay ending up with another arrow::Table with the result, result:

// 1: Wrap the Table in a Dataset so we can use a Scanner
std::shared_ptr<arrow::dataset::Dataset> dataset =
    std::make_shared<arrow::dataset::InMemoryDataset>(tbl);

// 2: Build ScannerOptions for a Scanner to do a basic filter operation
auto options = std::make_shared<arrow::dataset::ScanOptions>();

options->filter = arrow::compute::greater(
    arrow::compute::field_ref("a"), 
    arrow::compute::literal(3)); // Change for your use case

// 3: Build the Scanner
auto builder = arrow::dataset::ScannerBuilder(dataset, options);
auto scanner = builder.Finish();

// 4: Perform the Scan and make a Table with the result
auto result = scanner.ValueUnsafe()->ToTable();

See https://gist.github.com/amoeba/32d93556560c3386c066b40f3d37d987 for a complete source listing.

amoeba
  • 4,015
  • 3
  • 21
  • 14