12

I have a large dictionary that I want to iterate through to build a pyarrow table. The values of the dictionary are tuples of varying types and need to be unpacked and stored in separate columns in the final pyarrow table. I do know the schema ahead of time. The keys also need to be stored as a column. I have a method below to construct the table row by row - is there another method that is faster? For context, I want to parse a large dictionary into a pyarrow table to write out to a parquet file. RAM usage is less of a concern than CPU time. I'd prefer not to drop down to the arrow C++ API.

import pyarrow as pa
import random
import string 
import time

large_dict = dict()

for i in range(int(1e6)):
    large_dict[i] = (random.randint(0, 5), random.choice(string.ascii_letters))


schema = pa.schema({
        "key"  : pa.uint32(),
        "col1" : pa.uint8(),
        "col2" : pa.string()
   })

start = time.time()

tables = []
for key, item in large_dict.items():
    val1, val2 = item
    tables.append(
            pa.Table.from_pydict({
                    "key"  : [key],
                    "col1" : [val1],
                    "col2" : [val2]
                }, schema = schema)

            )

table = pa.concat_tables(tables)
end = time.time()
print(end - start) # 22.6 seconds on my machine

Josh W.
  • 1,123
  • 1
  • 10
  • 17

2 Answers2

10

Since the schema is known ahead of time, you can make a list for each column and build a dictionary of column name and column values pairs.

%%timeit -r 10
import pyarrow as pa
import random
import string 
import time

large_dict = dict()

for i in range(int(1e6)):
    large_dict[i] = (random.randint(0, 5), random.choice(string.ascii_letters))


schema = pa.schema({
        "key"  : pa.uint32(),
        "col1" : pa.uint8(),
        "col2" : pa.string()
  })

keys = []
val1 = []
val2 = []
for k, (v1, v2) in large_dict.items():
  keys.append(k)
  val1.append(v1)
  val2.append(v2)

table = pa.Table.from_pydict(
    dict(
        zip(schema.names, (keys, val1, val2))
    ),
    schema=schema
)

2.92 s ± 236 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Oluwafemi Sule
  • 36,144
  • 1
  • 56
  • 81
  • 1
    Thanks, this is faster. Would you expect any benefit from using pyarrow arrays instead of lists? I know the number of elements ahead of time so could pre-allocate. – Josh W. Sep 15 '19 at 01:29
  • That should prove to be similar performance. `pyarrow` is implicitly converting the native python list to an array https://github.com/apache/arrow/blob/c4671b32dfa45d2960d802fbf7099e9eea4a623d/python/pyarrow/table.pxi#L1135 – Oluwafemi Sule Sep 15 '19 at 11:18
  • 1
    pyarrow arrays are immutable, so you'll have a hard time appending to them. But you could use `numpy` `ndarray` and that should be faster than python lists. – 0x26res Sep 16 '19 at 09:13
  • If the schema is _not_ known ahead of time, just use`pa.Table.from_pydict()` without a `pa.schema` and it will infer the data types. – Attila the Fun Apr 05 '23 at 12:14
3

I am playing with pyarrow as well. For me it seems that in your code data-preparing stage (random, etc) is most time consuming part itself. So may be first try to convert data into dict of arrays, and then feed them to Arrow Table.

Please look, I make example based on your data and %%timeit-ing only Table population stage. But do it with RecordBatch.from_arrays() and array of three arrays.

I = iter(pa.RecordBatch.\
         from_arrays(
                      get_data(l0, l1_0, l2, i),
                      schema=schema) for i in range(1000)
        )

T1 = pa.Table.from_batches(I, schema=schema)

With static data set 1000 rows batched 1000 times - table is populated with incredible 15 ms :) Due to caching maybe. And with 1000 rows modified like col1*integer batched 1000 times - 33.3 ms, which is also looks nice.

My sample notebook is here

PS. I was wondering could be numba jit be helpful, but seems it only making timing worse here.

Dima Fomin
  • 1,228
  • 1
  • 16
  • 27