How do I use pandas.read_csv on Google Cloud ML?

Question

I'm trying to deploy a training script on Google Cloud ML. Of course, I've uploaded my datasets (CSV files) in a bucket on GCS.

I used to import my data with read_csv from pandas, but it doesn't seem to work with a GCS path.

How should I proceed (I would like to keep using pandas) ?

import pandas as pd
data = pd.read_csv("gs://bucket/folder/file.csv")

output :

ERROR 2018-02-01 18:43:34 +0100 master-replica-0 IOError: File gs://bucket/folder/file.csv does not exist

Hafizur Rahman · Accepted Answer · 2020-04-02T23:45:29.240

You will require to use file_io from tensorflow.python.lib.io to do that as demonstrated below:

from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
from pandas import read_csv

# read csv file from google cloud storage
def read_data(gcs_path):     
   file_stream = file_io.FileIO(gcs_path, mode='r')
   csv_data = read_csv(StringIO(file_stream.read()))
   return csv_data

Now call the above function

 gcs_path = 'gs://bucket/folder/file.csv' # change path according to your bucket, folder and path
 df = read_data(gcs_path)
 # print(df.head()) # displays top 5 rows including headers as default

score 1 · Answer 2 · answered Feb 01 '18 at 19:26

Pandas does not have native GCS support. There are two alternatives: 1. copy the file to the VM using gsutil cli 2. use the TensorFlow file_io library to open the file, and pass the file object to pd.read_csv(). Please refer to the detailed answer here.

Miguel Galeano · Answer 3 · 2018-09-25T16:50:46.900

You could also use Dask to extract and then load the data into, let's say, a Jupyter Notebook running on GCP.

Make sure you have Dask is installed.

conda install dask #conda
pip install dask[complete] #pip

import dask.dataframe as dd #Import 

dataframe = dd.read_csv('gs://bucket/datafile.csv') #Read CSV data
dataframe2 = dd.read_csv('gs://bucket/path/*.csv') #Read parquet data

This is all you need to load the data.

You can filter and manipulate data with Pandas syntax now.

dataframe['z'] = dataframe.x + dataframe.y

dataframe_pd = dataframe.compute()

How do I use pandas.read_csv on Google Cloud ML?

3 Answers3

Linked