1

I am running a Notebook instance from the AI Platform on a E2 high-memory VM with 4 vCPUs and 32Gb RAM

I need to read a partitioned parquet file with about 1.8Gb from Google Storage using pandas

It needs to be completely loaded in RAM and I can't use Dask compute for it. Nonetheless, I tried loading through this route and it gave the same problem

When I download the file locally in the VM, I can read it with pd.read_parquet. The RAM consumption goes up to about 13Gb and then down to 6Gb when the file is loaded. It works.

df = pd.read_parquet("../data/file.parquet",
                    engine="pyarrow")

When I try to load it directly from Google Storage the RAM goes up to about 13Gb and then the kernel dies. No log, warnings or errors raised.

df = pd.read_parquet("gs://path_to_file/file.parquet",
                    engine="pyarrow")

Some info on the packages versions

Python 3.7.8
pandas==1.1.1
pyarrow==1.0.1

What could be causing it?

bpbutti
  • 381
  • 1
  • 8
  • 1
    Just to have a better context, are you using BigQuery too? There is an extense and detailed document for load parquet files from Cloud Storage https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#python – Harif Velarde Sep 08 '20 at 20:32
  • Is dask anyhow involved here? – rpanai Sep 08 '20 at 20:56
  • @HarifVelarde No, we are not using it. I know we could but I think right know it would add an unecessary step. And I am really intrigued on why it is not working on GCP. The same procedure over AWS and S3 works fine. – bpbutti Sep 08 '20 at 21:16

2 Answers2

0

I found a thread where is explained how to execute this task on different way.

For your scenario use the GCSF service is a good option, for example:

import pyarrow.parquet as pq
import gcsfs

fs = gcsfs.GCSFileSystem(project=myprojectname)

f = fs.open('my_bucket/path.csv')
myschema = pq.ParquetFile(f).schema

print(schema)

If you want to know more about this service, take a look at this document

Harif Velarde
  • 733
  • 5
  • 10
0

The problem was being caused by a deprecated image version on the VM.

According to GCP's support you can find if the image is deprecated by

  1. Go to GCE and click on “VM instances”.
  2. Click on the “VM instance” in question
  3. Look for the section “Boot disk” and click on the Image link.
  4. If the image has been Deprecated, there will be a field showing it.

enter image description here

The solution to it is to create a new Notebook Instance and export/import whatever you want to keep. That way the new VM will have an updated image which hopefully has a fix for the problem

bpbutti
  • 381
  • 1
  • 8