How can I construct a DataFrame that uses the PyArrow backend directly (i.e., without a pd.read_xx() method)?

Question

Pandas 2.0 introduces the option to use PyArrow as the backend rather than NumPy. As of version 2.0, using it seems to require either calling one of the pd.read_xxx() methods with type_backend='pyarrow', or else constructing a DataFrame that's NumPy-backed and then calling .convert_dtypes on it.

Is there a more direct way to construct a PyArrow-backed DataFrame?

Attila the Fun · Accepted Answer · 2023-04-15T18:40:34.687

If your data are known to be all of a specific type (say, int64[pyarrow]), this is straightforward:

import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': [1, 2, 3, 4]}
df = pd.DataFrame(
    data,
    dtype='int64[pyarrow]',
    # ...
)

If your data are known to be all of the same type but the type is not known, then I don't know of a way to use the constructor. I tried dtype=pd.ArrowDtype, which does not work, and dtype=pd.ArrowDtype(), which needs an argument that I think would have to be a specific dtype.

One option for possibly-mixed and unknown data types is to make a pa.Table (using one of its methods) and then send it to pandas with the types_mapper kwarg. For example, using a dict:

import pyarrow as pa

data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}

pa_table = pa.Table.from_pydict(data)
df = pa_table.to_pandas(types_mapper=pd.ArrowDtype)

The last line is exactly what pd.read_parquet with dtype_backend='pyarrow' does under the hood, after reading parquet into a pa.Table. I thought it was worth highlighting the approach since it wouldn't have occurred to me otherwise.

The method pa.Table.from_pydict() will infer the data types. If the data are of mixed type, but known, and speed is very important, see https://stackoverflow.com/a/57939649 for how to make a predefined schema to pass to the pa.Table constructor.

The above method loses most of the flexibility of the DataFrame constructor (specifying an index, accepting various container types as input, etc.). You might be able to code around this and encapsulate it in a function.

Another workaround, as mentioned in the question, is to just construct a NumPy-backed DataFrame and call .convert_dtypes on it:

import pandas as pd

data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(
    data,
    index=[4, 5, 6, 7],
    # ...
).convert_dtypes(dtype_backend='pyarrow')

arg to ```convert_dtypes()``` function is ```dtype_backend``` not ```type_backend``` — blahblahetcetc, Apr 14 '23 at 19:11
Using `.convert_dtypes(dtype_backend='pyarrow')` is giving me `NameError: name 'ArrowDtype' is not defined`. I'm not sure why. — KG in Chicago, Jun 28 '23 at 05:14

How can I construct a DataFrame that uses the PyArrow backend directly (i.e., without a pd.read_xx() method)?

1 Answers1

Linked