Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built
after upgrading to pyarrow=3.0.0
. My previous version without this error was pyarrow=0.17
. The error does not appear in pyarrow=1.0.1
and does appear in pyarrow=2.0.0
. The idea is to write a pandas DataFrame as a Parquet Dataset (on Windows) using Snappy compression, and later to process the Parquet Dataset using Spark.
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame({
'x': [0, 0, 0, 1, 1, 1],
'a': np.random.random(6),
'b': np.random.random(6)})
table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path=r'c:/data', partition_cols=['x'], flavor='spark')