15

Using Python, Parquet, and Spark and running into ArrowNotImplementedError: Support for codec 'snappy' not built after upgrading to pyarrow=3.0.0. My previous version without this error was pyarrow=0.17. The error does not appear in pyarrow=1.0.1 and does appear in pyarrow=2.0.0. The idea is to write a pandas DataFrame as a Parquet Dataset (on Windows) using Snappy compression, and later to process the Parquet Dataset using Spark.

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({
    'x': [0, 0, 0, 1, 1, 1], 
    'a': np.random.random(6), 
    'b': np.random.random(6)})
table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path=r'c:/data', partition_cols=['x'], flavor='spark')

enter image description here

Russell Burdt
  • 2,391
  • 2
  • 19
  • 30
  • How did you install `pyarrow`? – Uwe L. Korn Feb 03 '21 at 11:01
  • ````pyarrow```` was installed via ````conda install pyarrow```` – Russell Burdt Feb 03 '21 at 17:32
  • I've been unable to reproduce using Windows python 3.8/3.9 and pypi and conda-forge builds. As Uwe mentioned elsewhere, snappy should be built into the pyarrow dist on conda. Can you add the output of `conda list --export` and `print(pa.cpp_build_info)` and `pa.show_versions()`? – Pace Feb 03 '21 at 19:05
  • $ python Python 3.9.1 (default, Dec 11 2020, 09:29:25) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> print(pa.cpp_build_info) BuildInfo(version='3.0.0', version_info=VersionInfo(major=3, minor=0, patch=0), so_version='300', full_so_version='300.0.0', compiler_id='MSVC', compiler_version='19.16.27043.0', compiler_flags=' -D_WIN32_WINNT=0x600 /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING ', git_id='', git_description='', package_kind='') – Russell Burdt Feb 03 '21 at 20:00
  • >>> pa.show_versions() pyarrow version info -------------------- Package kind: not indicated Arrow C++ library version: 3.0.0 Arrow C++ compiler: MSVC 19.16.27043.0 Arrow C++ compiler flags: -D_WIN32_WINNT=0x600 /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING Arrow C++ git revision: Arrow C++ git description: >>> – Russell Burdt Feb 03 '21 at 20:00
  • 3
    Your `pyarrow` is not from conda-forge. It shows up in `conda list` as `pyarrow=3.0.0=pypi_0` which I thought meant it came from pypi. However, your `cpp_build_info` does not match what comes from the PYPI distribution either (both conda-forge and pypi use MSVC version 19.16.27045.0). Uninstall pyarrow and reinstall, ensuring you are installing from conda-forge... `conda install -c conda-forge pyarrow` – Pace Feb 03 '21 at 20:26
  • OK that works, using ````conda install -c conda-forge pyarrow```` instead of ````conda install pyarrow````. If you provide this as an answer I can accept. But why is it like this? Because in both cases it shows up as ````pyarrow=3.0.0```` so this would not be the expected behavior. – Russell Burdt Feb 03 '21 at 20:36
  • Added an answer with my best conjecture. Unfortunately, I'm just not sure. Since `pypi_0` just means "not conda" then it really could have come from anywhere. – Pace Feb 03 '21 at 20:57
  • 2
    Added this to the upstream Anaconda issue https://github.com/AnacondaRecipes/pyarrow-feedstock/issues/2 – Uwe L. Korn Feb 08 '21 at 14:23

5 Answers5

13

Something is wrong with the conda install pyarrow method. I removed it with conda remove pyarrow and after that installed it with pip install pyarrow. This ended up working.

Michel K
  • 641
  • 1
  • 6
  • 18
  • This worked for me as well. Quick and easy fix. Setup: Windows 10 x64 with Python 3.8. Everything worked fine locally, but for some reason I got this issue when connecting to a remote Windows 10 x64 box via SSH to a Windows prompt, even when the remote path was 100% identical. – Contango Apr 29 '21 at 07:54
  • Thanks. I see, as of this date, conda installs 3.0 but pip install 4.0. – BSalita May 18 '21 at 16:39
  • This worked for me too (with Windows 10), but in my case I was getting the same error while exactly following the instructions for [Spyder's scientific-computing demo](https://docs.spyder-ide.org/current/workshops/scientific-computing.html). Thank you for posting this. – r.e.s. Jan 25 '22 at 02:22
10

The pyarrow package you had installed did not come from conda-forge and it does not appear to match the package on PYPI. I did a bit more research and pypi_0 just means the package was installed via pip. It does not mean it actually came from PYPI.

I'm not really sure how this happened. You could maybe check your conda log (envs/YOUR-ENV/conda-meta/history) but, given that this was installed external from conda, I'm not sure there will be any meaningful information in there. Perhaps you tried to install Arrow after the version was bumped to 3 and before the wheels were uploaded and so your system fell back to building from source?

Pace
  • 41,875
  • 13
  • 113
  • 156
  • Don't think the reason is unfortunate timing becasue it was replicated on different days. Will add that installing from ````conda-forge```` may be preferable to installing from default ````conda```` channel, generally. That issue has been [discussed widely on SO](https://stackoverflow.com/questions/39857289/should-conda-or-conda-forge-be-used-for-python-environments). – Russell Burdt Feb 03 '21 at 21:36
  • Ok I managed to get it to work by doing a pip install pyArrow from Conda prompt. Conda install, and conda forge install did not work. – Reddspark Dec 09 '21 at 19:00
1

I had the exact same issue. Did fresh install of Anaconda 3.8. then did conda install -c conda-forge pyarrow from this link "https://anaconda.org/conda-forge/pyarrow". It chokes through this install but fails with frozen/flexible solve and conda keeps trying different variants until finally it installs. You can then import pyarrow. But then, when you try to open a parquet file, you get the 'snappy' codec error - the subject of this thread.

I then did conda remove pyarrow so I was back to a clean install. Then pip install pyarrow, and I could successfully load the parquet file.

clg4
  • 2,863
  • 6
  • 27
  • 32
0

I managed to get it to work by doing a pip install pyArrow from Conda prompt.

Reddspark
  • 6,934
  • 9
  • 47
  • 64
-1

I'm not 100%, but it could be because since version 1.0.0 they slimmed down the default arrow build and snappy became an optional component, see

I think you would have to rebuild arrow using -DARROW_WITH_SNAPPY=ON, see. But this can be quite difficult and tedious to get to work.

Another option would be to disable snappy:

pq.write_to_dataset(table, root_path=r'c:/data', partition_cols=['x'], flavor='spark', compression="NONE")
0x26res
  • 11,925
  • 11
  • 54
  • 108
  • `pyarrow` was slimmed down a bit but the default builds of the Python packages should still include most of the features, especially the Snappy compression as this is the default / best choice for Parquet files. – Uwe L. Korn Feb 03 '21 at 11:02
  • even on Windows? – 0x26res Feb 03 '21 at 11:25
  • The error does appear on Windows with ````pyarrow=3.0.0```` and with ````pyarrow=2.0.0````. The error does not appear on Windows with ````pyarrow=1.0.1```` and with ````pyarrow=0.17````. Reading the [release notes](https://arrow.apache.org/blog/2020/10/22/2.0.0-release/) for ````pyarrow=2.0.0```` I do not see anything referencing Snappy compression, so this may be a bug. – Russell Burdt Feb 03 '21 at 18:46
  • 1
    @0x26res Snappy compression provides useful benefits and I do not view disabling as a solution in this case. – Russell Burdt Feb 03 '21 at 18:50