2

I am currently trying to open parquet files using Azure Jupyter Notebooks. I have tried both Python kernels (2 and 3). After the installation of pyarrow I can import the module only if the Python kernel is 2 (not working with Python 3)

Here is what I've done so far (for clarity, I am not mentioning all my various attempts, such as using conda instead of pip, as it also failed):

!pip install --upgrade pip
!pip install -I Cython==0.28.5
!pip install pyarrow

import pandas  
import pyarrow
import pyarrow.parquet

#so far, so good

filePath_parquet = "foo.parquet"
table_parquet_raw = pandas.read_parquet(filePath_parquet, engine='pyarrow')

This works well if I'm doing that off-line (using Spyder, Python v.3.7.0). But it fails using an Azure Notebook.

 AttributeErrorTraceback (most recent call last)
<ipython-input-54-2739da3f2d20> in <module>()
      6 
      7 #table_parquet_raw = pd.read_parquet(filePath_parquet, engine='pyarrow')
----> 8 table_parquet_raw = pandas.read_parquet(filePath_parquet, engine='pyarrow')

AttributeError: 'module' object has no attribute 'read_parquet'

Any idea please?

Thank you in advance !

EDIT:

Thank you very much for your reply Peter Pan ! I have typed these statements, here is what I got:

1.

    print(pandas.__dict__)

=> read_parquet does not appear

2.

    print(pandas.__file__)

=> I get:

    /home/nbuser/anaconda3_23/lib/python3.4/site-packages/pandas/__init__.py
  1. import sys; print(sys.path) => I get:

    ['', '/home/nbuser/anaconda3_23/lib/python34.zip',
    '/home/nbuser/anaconda3_23/lib/python3.4',
    '/home/nbuser/anaconda3_23/lib/python3.4/plat-linux',
    '/home/nbuser/anaconda3_23/lib/python3.4/lib-dynload',
    '/home/nbuser/.local/lib/python3.4/site-packages',
    '/home/nbuser/anaconda3_23/lib/python3.4/site-packages',
    '/home/nbuser/anaconda3_23/lib/python3.4/site-packages/Sphinx-1.3.1-py3.4.egg',
    '/home/nbuser/anaconda3_23/lib/python3.4/site-packages/setuptools-27.2.0-py3.4.egg',
    '/home/nbuser/anaconda3_23/lib/python3.4/site-packages/IPython/extensions',
    '/home/nbuser/.ipython']
    

Do you have any idea please ?

EDIT 2:

Dear @PeterPan, I have typed both !conda update conda and !conda update pandas : when checking the Pandas version (pandas.__version__), it is still 0.19.2.

I have also tried with !conda update pandas -y -f, it returns: `Fetching package metadata ........... Solving package specifications: .

Package plan for installation in environment /home/nbuser/anaconda3_23:

The following NEW packages will be INSTALLED:

pandas: 0.19.2-np111py34_1`

When typing: !pip install --upgrade pandas

I get:

Requirement already up-to-date: pandas in /home/nbuser/anaconda3_23/lib/python3.4/site-packages Requirement already up-to-date: pytz>=2011k in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas) Requirement already up-to-date: numpy>=1.9.0 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas) Requirement already up-to-date: python-dateutil>=2 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas) Requirement already up-to-date: six>=1.5 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from python-dateutil>=2->pandas)

Finally, when typing:

!pip install --upgrade pandas==0.24.0

I get:

Collecting pandas==0.24.0 Could not find a version that satisfies the requirement pandas==0.24.0 (from versions: 0.1, 0.2b0, 0.2b1, 0.2, 0.3.0b0, 0.3.0b2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0rc1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0rc1, 0.8.0rc2, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0rc1, 0.19.0, 0.19.1, 0.19.2, 0.20.0rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0rc1, 0.21.0, 0.21.1, 0.22.0) No matching distribution found for pandas==0.24.0

Therefore, my guess is that the problem comes from the way the packages are managed in Azure. Updating a package (here Pandas), should lead to an update to the latest version available, shouldn't it?

Peter Pan
  • 23,476
  • 4
  • 25
  • 43
Menas
  • 25
  • 4

1 Answers1

1

I tried to reproduce your issue on my Azure Jupyter Notebook, but failed. There was no any issue for me without doing your two steps !pip install --upgrade pip & !pip install -I Cython==0.28.5 which I think not matter.

Please run some codes below to check your import package pandas whether be correct.

  1. Run print(pandas.__dict__) to check whether has the description of read_parquet function in the output.
  2. Run print(pandas.__file__) to check whether you imported a different pandas package.
  3. Run import sys; print(sys.path) to check the order of paths whether there is a same named file or directory under these paths.

If there is a same file or directory named pandas, you just need to rename it and restart your ipynb to re-run. It's a common issue which you can refer to these SO threads AttributeError: 'module' object has no attribute 'reader' and Importing installed package from script raises "AttributeError: module has no attribute" or "ImportError: cannot import name".

In Other cases, please update your post for more details to let me know.


The latest pandas version should be 0.23.4, not 0.24.0.

I tried to find out the earliest version of pandas which support the read_parquet feature via search the function name read_parquet in the documents of different version from 0.19.2 to 0.23.3. Then, I found pandas supports read_parquet feature after the version 0.21.1, as below.

enter image description here

The new features shown in the What's New of version 0.21.1 enter image description here

According to your EDIT 2 description, it seems that you are using Python 3.4 in Azure Jupyter Notebook. Not all pandas versions support Python 3.4 version.

The versions 0.21.1 & 0.22.0 offically support Python 2.7,3.5, and 3.6, as below. enter image description here

And the PyPI page for pandas also requires the Python version as below.

enter image description here

So you can try to install the pandas versions 0.21.1 & 0.22.0 in the current notebook of Python 3.4. if failed, please create a new notebook in Python 2.7 or >=3.5 to install pandas version >= 0.21.1 to use the function read_parquet.

Peter Pan
  • 23,476
  • 4
  • 25
  • 43
  • Thank you very much for your answer @PeterPan ! I habve edited my question. Could you please take a look? -Best regards – Menas Jan 07 '19 at 09:51
  • @Menas Try to print your pandas version via `pandas.__version__`. In my local machine, I installed `pandas` via `conda install pandas` in my miniconda environ, and the version is `0.23.4` which has the function `read_parquet`. You can try to update your conda or pandas via `!conda update ` in iPython. – Peter Pan Jan 07 '19 at 11:16
  • Dear @PeterPan, when you have time, could you have a look to my latest update please? (Or anyone else who would know how to solve this issue :-) – Menas Jan 09 '19 at 15:35
  • @Menas Please see my update answer. After I researched for your information, you can try to install the pandas versions `0.21.1` & `0.22.0` in the current notebook of Python `3.4`. if failed, please create a new notebook in Python `2.7` or `>=3.5` to install pandas version `>= 0.21.1` to use the function read_parquet. – Peter Pan Jan 11 '19 at 07:02
  • Dear @PeterPan, Thank you very much, the problem is now solved! I have updated the Kernel to 2 (Python 2.7) and updated Pandas to 0.22.0. After installing PyArrow, everything works like a charm! – Menas Jan 14 '19 at 14:33