Write a Pandas DataFrame to Google Cloud Storage or BigQuery

Question

Hello and thanks for your time and consideration. I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab. I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery. I have a bucket in GCS and have, via the following code, created the following objects:

import gcp
import gcp.storage as storage
project = gcp.Context.default().project_id    
bucket_name = 'steve-temp'           
bucket_path  = bucket_name   
bucket = storage.Bucket(bucket_path)
bucket.exists()

I have tried various approaches based on Google Datalab documentation but continue to fail. Thanks

I've developed a python package specifically designed for transferring data from one location (e.g. a pandas.DataFrame) to another one (e.g. BigQuery or Storage): [google-pandas-load.readthedocs.io/en/latest/](https://google-pandas-load.readthedocs.io/en/latest/). Moreover it has 100% test coverage. — augustin-barillec, Dec 05 '19 at 09:38

score 54 · Answer 1 · edited Jun 20 '20 at 09:12

54

Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module

from google.cloud import storage
import os
import pandas as pd

# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'

df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
    
bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 08 '19 at 01:33

Theo

1,821
2
17
11

5

Really appreciate this one for using no other modules and an existing bucket. – adamc Jan 15 '20 at 19:14
2

if you only want to push the file to a bucket on GCS then this is a more suitable solution. This can also be used in case you want to push out json format : bucket.blob('upload_test/test.json').upload_from_string(df.to_json(), 'text/json') – Amjad Desai Feb 22 '21 at 20:56
Use `df.to_csv(index=False)` if you don't want the index as a column in your file – Nermin Jun 18 '21 at 06:44
Hi @Theo, could you please elaborate a bit more on this: "without writing a temporary"? Thanks! – RafalK Mar 29 '23 at 09:46
@RafalK a few years ago when I wrote this there were a few examples of how to do this, but they always included the intermediate step of writing the data frame to disk temporarily and then uploading the file or using another package to simplify the process. Writing to disk is unnecessary and I didn't want to add another package, so this solution avoids both constraints and is probably a bit faster as well. – Theo Mar 30 '23 at 12:37

score 44 · Answer 2 · answered Mar 11 '20 at 21:31

44

I spent a lot of time to find the easiest way to solve this:

import pandas as pd

df = pd.DataFrame(...)

df.to_csv('gs://bucket/path')

answered Mar 11 '20 at 21:31

Vova Pytsyuk

599
4
6

4

This is hilariously simple. Just make sure to also install `gcsfs` as a prerequisite (though it'll remind you anyway). If you're coming here in 2020 or later, just skip the complexity and do this. – bsplosion Dec 16 '21 at 04:59
1

Is there a way to make a saved file publically accessible directly by passing any argument? – Danish Bansal Jan 28 '22 at 09:31
1

It is not working. I have created a ubuntu server and installed `pip install pandas fsspec gcsfs`. I am able to read csv file using `pd.read_csv(gs://BUCKET_PATH)` but not able to write – Shiv Krishna Jaiswal Apr 09 '22 at 03:30
1

Got the answer of my own question. It is access issue. See this **[link](https://stackoverflow.com/a/60877896/5785058)** – Shiv Krishna Jaiswal Apr 09 '22 at 04:06
1

@bsplosion both `gcsfs` and `fsspec` are required as dependency. (May 2023) – 刘宇翔 May 29 '23 at 01:43

score 23 · Accepted Answer · edited Feb 28 '18 at 08:35

23

Try the following working example:

from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd

# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create storage bucket if it does not exist
if not sample_bucket.exists():
    sample_bucket.create()

# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)

I used this example, and the _table.py file from the datalab github site as a reference. You can find other datalab source code files at this link.

edited Feb 28 '18 at 08:35

Chris Meyers

1,426
9
14

answered Mar 31 '16 at 11:20

Anthonios Partheniou

1,699
1
15
25

1

Just a note: I believe you need to execute the %%storage commands in a separate cell from the Python code? – dartdog Mar 31 '16 at 16:13
3

It depends on whether you want to execute a line magic or cell magic command. For cell magic it is %%storage, for line magic it is %storage. It's ok to use line magic commands in the same cell as other code. Cell magic commands must be in a separate cell from other code – Anthonios Partheniou Mar 31 '16 at 16:43
Thanks for the clarification – dartdog Mar 31 '16 at 17:24
Thanks very much Anthonios... I was able to successfully create all of the objects (e.g., the table and the schema are in my Project/Dataset in BQ). However, no rows were actually written to the table and no error messages were generated. – EcoWarrior Mar 31 '16 at 19:45
A populated table was generated in the Jupyter Notebook after table.Insert_data(out) and this line was at the bottom of that table: (rows: 0, edw-p19090000:ClickADS2.ADS_Logit1) – EcoWarrior Mar 31 '16 at 19:52
I found a similar stackoverflow question related to delayed data. Please check the solution at the following link to see if you are experiencing a similar issue: http://stackoverflow.com/questions/35656910/datalab-does-not-populate-bigquery-tables/35663982#35663982 – Anthonios Partheniou Mar 31 '16 at 23:52
It's very slow. The direct conversion from dataframe to bigquery. Anyways faster? – Elona Mishmika Feb 12 '18 at 11:33
One possible faster method: Write the csv to Google Cloud Storage first, then use the command line `bq` tool to load from GCS to BigQuery. You could also look into using Google Cloud Dataflow. – Anthonios Partheniou Feb 12 '18 at 12:01
I'm doing this now :) However, I found when I wrote it into GCS, it doesn't have the comma to seperate every columns. Do you have this problem too? – Elona Mishmika Feb 14 '18 at 00:15
IMPORTANT NOTE: google.datalab is now deprecated as per this note. Probably not the best package to use. https://pypi.org/project/datalab/ – Diana Vazquez Romo Nov 03 '22 at 14:57

score 17 · Answer 4 · answered Mar 14 '17 at 15:43

17

Using the Google Cloud Datalab documentation

import datalab.storage as gcs
gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')

answered Mar 14 '17 at 15:43

Jan Krynauw

1,042
10
21

score 9 · Answer 5 · edited Jun 20 '20 at 09:12

Writing a Pandas DataFrame to BigQuery

Update on @Anthonios Partheniou's answer.
The code is a bit different now - as of Nov. 29 2017

To define a BigQuery dataset

Pass a tuple containing project_id and dataset_id to bq.Dataset.

# define a BigQuery dataset    
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)

To define a BigQuery table

Pass a tuple containing project_id, dataset_id and the table name to bq.Table.

# define a BigQuery table    
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)

Create the dataset/ table and write to table in BQ

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)

the `exists()` function doesn't exist for me on `1.11.2` for `google-cloud-bigquery` in python — pascalwhoop, May 10 '19 at 09:53

score 8 · Answer 6 · answered Jan 30 '19 at 06:43

Since 2017, Pandas has a Dataframe to BigQuery function pandas.DataFrame.to_gbq

The documentation has an example:

import pandas_gbq as gbq gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')

Parameter if_exists can be set to 'fail', 'replace' or 'append'