8

I have installed installed the following modules in my EC2 server which already has python (3.6) & anaconda installed :

  • snappy
  • pyarrow
  • s3fs
  • fastparquet

except fastparquet everything else works on importing. When I try to import fastparquet it throws the following error :

[username@ip8 ~]$ conda -V
conda 4.2.13
[username@ip-~]$ python
    Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00)
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
     import fastparquet
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/__init__.py", line 15, in <module>
        from .core import read_thrift
      File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/core.py", line 11, in <module>
        from .compression import decompress_data
      File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/compression.py", line 43, in <module>
        compressions['SNAPPY'] = snappy.compress
    AttributeError: module 'snappy' has no attribute 'compress'

How do I go about fixing this ?

stormfield
  • 1,696
  • 1
  • 14
  • 26

1 Answers1

12

Unfortunately, there are multiple things in python-land called "snappy". I believe you may have the wrong one, in which case one of the following conda commands should solve this for you:

conda install python-snappy

or

conda install python-snappy -c conda-forge

where the latter is slightly more recent (releases the GIL which can be important in threaded applications).

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • As per your recommendation i installed it via the following command `conda install -c conda-forge python-snappy=0.5.1` But it still get the same error when I try to import fastparquet @mdurant – stormfield Jun 02 '17 at 08:41
  • would i need to uninstall the existing 'snappy' package and re-install fastparaquet again ? From the [source code](https://github.com/dask/fastparquet/blob/master/fastparquet/compression.py) of fastparquet it seems to be importing snappy itself directly. – stormfield Jun 02 '17 at 08:54
  • I tired removing snappy using `conda remove snappy`. It removed `python-snappy: 0.5.1-py36_0 conda-forge` and `snappy: 1.1.4-1 conda-forge`. After which i tried installing python snappy `conda install -c conda-forge python-snappy=0.5.1` which installed the same two packages. But still I am getting the same error while I import fastparquet @mdurant – stormfield Jun 02 '17 at 09:06
  • Can you do `import snappy; print(snappy.__file__)`? This will show you where you are importing from, which I am assuming is some other "snappy" that you can probably remove. – mdurant Jun 02 '17 at 12:45
  • its says `>>> print(snappy.__file__) /home/my_username/anaconda3/lib/python3.6/site-packages/snappy/__init__.py` @mdurant – stormfield Jun 02 '17 at 13:55
  • ok found out the issue. your guidance lead me to the right track. Someone had installed [snapPy](https://www.math.uic.edu/t3m/SnapPy/installing.html) using PIP instead of conda and that was creating the whole confusion. i removed it using PIP uninstall. Now i am able to import fastparquet without error. Thank you :) – stormfield Jun 02 '17 at 14:38
  • Thats absolutely fine. I have one question though. Currently Does fastparquet support partition discovery while trying to read from s3 ? Eg : like in spark - http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery – stormfield Jul 14 '17 at 09:40
  • @stormfield : yes it does, both reading and writing - they will correspond to pandas categoricals. You will have much better performance for reading such datasets if you have a _metadata file in the top directory, rather than having to scan all the files first. http://fastparquet.readthedocs.io/en/latest/details.html#partitions-and-row-groups ; ; http://fastparquet.readthedocs.io/en/latest/filesystems.html – mdurant Jul 14 '17 at 14:01
  • _metafile is not an option for me at the moment since data is given by another system. But based on [this link](https://github.com/dask/fastparquet/pull/95) I tried to read the dataframe with from s3 using s3fs. My data is in the following format : `root_dir_in_s3/myset.parquet/unique_id=500/date=2017-03-16/part.0.parquet root_dir_in_s3/myset.parquet/unique_id=600/date=2017-02-16/part.0.parquet ` – stormfield Jul 17 '17 at 09:47
  • It threw token errors when i tried the following code : `>>> fp_obj = fp.ParquetFile(list_parquet_files,open_with=myopen)` [error link](https://pastebin.com/X3GMGtmA) then i tried with removing hyphens from the dates & it worked when i created the pandas dataframe. But I am still not getting the **unique id** column in final pandas DF. Is there a way we can still have the hyphens in dates and get the missing column in the final df ? @mdurant – stormfield Jul 17 '17 at 09:54
  • Would you mind trying with the latest master version? I believe this is something we have fixed since the last release. – mdurant Jul 17 '17 at 13:02
  • i remove the old version and tried with the latest from conda forge. It worked like a charm! Thank you so much. You just saved me a tonne of trouble !! – stormfield Jul 17 '17 at 14:17
  • A pleasure to help – mdurant Jul 17 '17 at 16:59
  • I think the fastparquet is not able to retrieve **unique id** when we only have one partition for it ie say i only had one value for unique id in the list of files `root_dir_in_s3/myset.parquet/unique_id=500/date=2017-03-16/p‌​art.0.parquet ` `root_dir_in_s3/myset.parquet/unique_id=500/date=2017-03-25/p‌​art.1.parquet ` – stormfield Jul 19 '17 at 15:26
  • You have a point - fastparquet will assume that the base directory is `root_dir_in_s3/myset.parquet/unique_id=500/` and not analyze the parent path. This is, to my mind, reasonable if there is no `_metadata` in `myset.parquet`. – mdurant Jul 19 '17 at 16:47
  • Can you suggest a work around for the same in the absence of _metadata file ? I have seen spark handling this scenario pretty well without having issues, is there a way we can raise a bug report and get this resolved ? small correction the directory structure is like `root_dir_in_s3/my_table/unique_id=500/date=2017-03-16/p‌​‌​art.0.parquet root_dir_in_s3/my_table/unique_id=500/date=2017-03-25/p‌​‌​art.1.parquet` – stormfield Jul 19 '17 at 16:54
  • For issues: https://github.com/dask/fastparquet/issues ; and it's a good idea, although I am the main developer who might solve this :) I think it would require an optional "root_path" to `ParquetFile`. – mdurant Jul 19 '17 at 17:37
  • Yes I kind of figured out you are :) . I have raised a issue [ISSUE#182](https://github.com/dask/fastparquet/issues/182) for the same. – stormfield Jul 19 '17 at 18:02