5

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said

Table columns in Arrow C++ can be chunked, so that appending to a table is a zero copy operation, requiring no non-trivial computation or memory allocation.

However, I am unable to find in the documentation how to append a row to a table. pyarrow.concat_tables(tables, promote=False) does something similar, but it is my understanding that it produces a new Table object, rather than, say, adding chunks to the existing one.

I am unsure if this is operation is at all possible/makes sense (in which case I'd like to know how) or if it doesn't (in which case, pyarrow.concat_tables is exactly what I need).

Similar questions:

astrojuanlu
  • 6,744
  • 8
  • 45
  • 105
  • 1
    This should probably be explained more clearly somewhere but effectively Table is a container of pointers to actual data. So you can concatenate two tables, and you'll just be copying the pointers, not the data itself. Hence ["If promote==False, a zero-copy concatenation will be performed."](https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html?highlight=concat_tables#pyarrow.concat_tables) – li.davidm Mar 10 '22 at 18:19

1 Answers1

9

Basically, a Table in PyArrow/Arrow C++ isn't really the data itself, but rather a container consisting of pointers to data. How it works is:

  • A Buffer represents an actual, singular allocation. In other words, Buffers are contiguous, full stop. They may be mutable or immutable.
  • An Array contains 0+ Buffers and imposes some sort of semantics into them. (For instance, an array of integers, or an array of strings.) Arrays are "contiguous" in the sense that each buffer is contiguous, and conceptually the "column" is not "split" across multiple buffers. (This gets really fuzzy with nested arrays: a struct array does split its data across multiple buffers, in some sense! I need to come up with a better wording of this, and will contribute this to upstream docs. But I hope what I mean here is reasonably clear.)
  • A ChunkedArray contains 0+ Arrays. A ChunkedArray is not logically contiguous. It's kinda like a linked list of chunks of data. Two ChunkedArrays can be concatenated "zero copy", i.e. the underlying buffers will not get copied.
  • A Table contains 0+ ChunkedArrays. A Table is a 2D data structure (both columns and rows).
  • A RecordBatch contains 0+ Arrays. A RecordBatch is also a 2D data structure.

Hence, you can concantenate two Tables "zero copy" with pyarrow.concat_tables, by just copying pointers. But you cannot concatenate two RecordBatches "zero copy", because you have to concatenate the Arrays, and then you have to copy data out of buffers.

li.davidm
  • 11,736
  • 4
  • 29
  • 31
  • 1
    Superb explanation! Thanks for spelling out the relationship between `Table -> ChunkedArray -> Array` and `RecordBatch -> Array`. The [Python docs](https://arrow.apache.org/docs/python/data.html) mention all the arrays separately but then `ChunkedArrays` are introduced suddenly when talking about tables, perhaps a better categorization could be done there. – astrojuanlu Mar 11 '22 at 14:33
  • You wrote that _"conceptually the "column" is not "split" across multiple buffers"_ in some sense, but not in another. Would it be correct to say that there is a bijection between elements of the Array and elements of _each_ of the 0+ Buffers? PyArrow documentation is unclear about the relation between an Array and Buffers. It may seem that "Array can hold multiple Buffers" contradicts "Array is contiguous". I guess, there is no contradiction because Buffers represent _different aspects_ of the Array, e.g. one Buffer stores int values, another Buffer stores NaN bitmap. Is this correct? – paperskilltrees Jul 26 '23 at 22:06