2

I'm trying to write an Apache Arrow table to a string. My big example has problems and I can't get this little example to work. This one segfaults inside of Arrow in the WriteTable call. My bigger example doesn't appear to serialize correctly.

#include <arrow/api.h>
#include <arrow/io/memory.h>
#include <arrow/ipc/api.h>
 
std::shared_ptr<arrow::Table> makeSimpleFakeArrowTable() {
    std::vector<std::shared_ptr<arrow::Field>> arrowFields;
    arrowFields.emplace_back(std::make_shared<arrow::Field>("Field1", arrow::int64()));
    arrowFields.emplace_back(std::make_shared<arrow::Field>("Field2", arrow::float64()));

    auto schema = std::make_shared<arrow::Schema>(arrowFields);

    std::vector<std::shared_ptr<arrow::Array>> columns(schema->num_fields());

    arrow::Int64Builder longBuilder;
    longBuilder.Append(20);
    longBuilder.Finish(&(columns.at(0)));
    arrow::DoubleBuilder doubleBuilder;
    doubleBuilder.Append(10.0);
    longBuilder.Finish(&(columns.at(1)));

    return arrow::Table::Make(schema, columns);
}

std::shared_ptr<arrow::RecordBatch>
getArrowBatchFromBytes(const std::string& bytes) {
    arrow::io::BufferReader arrowBufferReader{bytes};
    auto streamReader =
        arrow::ipc::RecordBatchStreamReader::Open(&arrowBufferReader).ValueOrDie();

    auto batch = streamReader->Next().ValueOrDie();

    return batch;
}


std::string arrowTableToByteString(const std::shared_ptr<arrow::Table>& table) {
    auto stream = arrow::io::BufferOutputStream::Create().ValueOrDie();
    auto batchWriter = arrow::ipc::MakeStreamWriter(stream, table->schema()).ValueOrDie();

    auto status = batchWriter->WriteTable(*table);
    if (not status.ok()) {
        throw std::runtime_error(
            "Couldn't write Arrow Table to byte string. Arrow status was: '" +
            status.ToString() + "'.");
    }

    std::shared_ptr<arrow::Buffer> buffer = stream->Finish().ValueOrDie();
    return buffer->ToHexString();
}

int main(int argc, char** argv) {
    auto simpleFakeArrowTable = makeSimpleFakeArrowTable();
    std::string tableAsByteString = arrowTableToByteString(simpleFakeArrowTable);

    auto batch = getArrowBatchFromBytes(tableAsByteString);
    assert(batch != nullptr);
}
user2183336
  • 706
  • 8
  • 19
  • `longBuilder.Finish(&(columns.at(0)));` -- It is a code smell to use address of items in a vector. Does that vector ever get resized? – PaulMcKenzie Oct 19 '21 at 15:22
  • @PaulMcKenzie thanks for asking. It does not get resized in that code. I use the vector ctor that takes a size to init it. I understand your point about the smell because of the resizing and would say the approach is probably still preferably to making a dynamic c-style array on the heap. – user2183336 Oct 19 '21 at 15:50

1 Answers1

1

Two things jump to mind. First, I think this is a typo:

    longBuilder.Finish(&(columns.at(0)));
    arrow::DoubleBuilder doubleBuilder;
    doubleBuilder.Append(10.0);
    longBuilder.Finish(&(columns.at(1))); // Shouldn't this be doubleBuilder?

Whenever you create an arrow table by yourself it is a good idea to call arrow::Table::ValidateFull. This will help to catch mistakes like this (in this case the status returned would have reported that the input arrays did not match the schema).

Second, if we fix that, we get an error because you return buffer->ToHexString(); which is going to turn your array of bytes into a hex string (e.g. the bytes [10, 20, 30] become the bytes [48, 48, 48, 65, 48, 48, 49, 52, 48, 48, 49, 69], more commonly represented as 000A0014001E).

You then turn around and try to read these hex bytes as a table arrow::io::BufferReader arrowBufferReader{bytes};. If I change that ToHexString to ToString then your example runs and returns 0.

Pace
  • 41,875
  • 13
  • 113
  • 156
  • One last minor note. Both the `ArrayBuilder::Append` method and the `ArrayBuilder::Finish` method return a status that you should be checking. – Pace Oct 19 '21 at 22:16
  • yes I think this is spot on. very impressive, thanks! – user2183336 Oct 21 '21 at 13:02