If your data are known to be all of a specific type (say, int64[pyarrow]
), this is straightforward:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': [1, 2, 3, 4]}
df = pd.DataFrame(
data,
dtype='int64[pyarrow]',
# ...
)
If your data are known to be all of the same type but the type is not known, then I don't know of a way to use the constructor. I tried dtype=pd.ArrowDtype
, which does not work, and dtype=pd.ArrowDtype()
, which needs an argument that I think would have to be a specific dtype.
One option for possibly-mixed and unknown data types is to make a pa.Table
(using one of its methods) and then send it to pandas with the types_mapper
kwarg. For example, using a dict
:
import pyarrow as pa
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pa_table = pa.Table.from_pydict(data)
df = pa_table.to_pandas(types_mapper=pd.ArrowDtype)
The last line is exactly what pd.read_parquet
with dtype_backend='pyarrow'
does under the hood, after reading parquet into a pa.Table
. I thought it was worth highlighting the approach since it wouldn't have occurred to me otherwise.
The method pa.Table.from_pydict()
will infer the data types. If the data are of mixed type, but known, and speed is very important, see https://stackoverflow.com/a/57939649 for how to make a predefined schema to pass to the pa.Table
constructor.
The above method loses most of the flexibility of the DataFrame
constructor (specifying an index, accepting various container types as input, etc.). You might be able to code around this and encapsulate it in a function.
Another workaround, as mentioned in the question, is to just construct a NumPy-backed DataFrame
and call .convert_dtypes
on it:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(
data,
index=[4, 5, 6, 7],
# ...
).convert_dtypes(dtype_backend='pyarrow')