0

In Apache Arrow, it seems to be possible to do queries that are similar to "group by" in SQL (see their documentation); however, there are not any examples of how to use this. I want to know how to go from an arrow::Table and for a given column be able to see the count for each distinct value in the column (I know I could just iterate over it manually). If this is the wrong way to do this, let me know, but I still think an example of how to do "group by" in C++ Arrow would be useful, as there are examples for python, but I could not find any examples of this for C++.

user3117152
  • 94
  • 14

1 Answers1

1

For the most flexibility you will want to make and execute a plan:

arrow::compute::Aggregate aggregate;
aggregate.function = "hash_sum";                             // The function to apply
aggregate.name = "SUM OF VALUES";                            // The default name of the output column
aggregate.options = nullptr;                                 // Custom options (e.g. how to handle null)
aggregate.target = std::vector<arrow::FieldRef>({"values"}); // Which field to aggregate.  Some aggregate functions (e.g. covariance)
                                                             // may require targetting multiple fields
arrow::compute::Declaration plan = arrow::compute::Declaration::Sequence({
  {"table_source", arrow::compute::TableSourceNodeOptions(std::move(sample_table))},
  {"aggregate", arrow::compute::AggregateNodeOptions(/*aggregates=*/{aggregate}, /*keys=*/{"keys"})}
});

ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> grouped, arrow::compute::DeclarationToTable(std::move(plan)));

However, if all you want to do is apply a group-by operation, there is also a convenience function:

// aggregate is defined the same as above
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> grouped,
                      arrow::compute::TableGroupBy(std::move(sample_table), {std::move(aggregate)}, {"keys"}));

Complete working example (tested on a fairly recent version of main): https://gist.github.com/westonpace/be500030cc268a626af60abb9299b9ae

Pace
  • 41,875
  • 13
  • 113
  • 156