1

According to documentation:

class arrow::StringType : public arrow::BinaryType
#include <arrow/type.h>
Concrete type class for variable-size string data, utf8-encoded.
class arrow::LargeStringType : public arrow::LargeBinaryType
#include <arrow/type.h>
Concrete type class for large variable-size string data, utf8-encoded.

How large is considered to be "large"?

What are the differences between the two data types? Why do we need 2 instead of 1?

1 Answers1

5

String uses signed 32-bit integers for its offsets/indices so you cannot have a string longer than 2 GiB, and you cannot have an array with more than 2 GiB of data total. LargeString uses 64-bit integers so you can have much longer strings and larger arrays.

li.davidm
  • 11,736
  • 4
  • 29
  • 31
  • I just discovered that this information is actually in the documentation, but not at the place I'd expect it to be. https://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow4Type4type12LARGE_STRINGE – NekoApocalypse Jul 25 '22 at 09:12