52

Hello and thanks for your time and consideration. I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab. I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery. I have a bucket in GCS and have, via the following code, created the following objects:

import gcp
import gcp.storage as storage
project = gcp.Context.default().project_id    
bucket_name = 'steve-temp'           
bucket_path  = bucket_name   
bucket = storage.Bucket(bucket_path)
bucket.exists()  

I have tried various approaches based on Google Datalab documentation but continue to fail. Thanks

EcoWarrior
  • 571
  • 1
  • 4
  • 12
  • I've developed a python package specifically designed for transferring data from one location (e.g. a pandas.DataFrame) to another one (e.g. BigQuery or Storage): [google-pandas-load.readthedocs.io/en/latest/](https://google-pandas-load.readthedocs.io/en/latest/). Moreover it has 100% test coverage. – augustin-barillec Dec 05 '19 at 09:38

10 Answers10

54

Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module

from google.cloud import storage
import os
import pandas as pd

# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'

df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
    
bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')
Community
  • 1
  • 1
Theo
  • 1,821
  • 2
  • 17
  • 11
  • 5
    Really appreciate this one for using no other modules and an existing bucket. – adamc Jan 15 '20 at 19:14
  • 2
    if you only want to push the file to a bucket on GCS then this is a more suitable solution. This can also be used in case you want to push out json format : bucket.blob('upload_test/test.json').upload_from_string(df.to_json(), 'text/json') – Amjad Desai Feb 22 '21 at 20:56
  • Use `df.to_csv(index=False)` if you don't want the index as a column in your file – Nermin Jun 18 '21 at 06:44
  • Hi @Theo, could you please elaborate a bit more on this: "without writing a temporary"? Thanks! – RafalK Mar 29 '23 at 09:46
  • @RafalK a few years ago when I wrote this there were a few examples of how to do this, but they always included the intermediate step of writing the data frame to disk temporarily and then uploading the file or using another package to simplify the process. Writing to disk is unnecessary and I didn't want to add another package, so this solution avoids both constraints and is probably a bit faster as well. – Theo Mar 30 '23 at 12:37
44

I spent a lot of time to find the easiest way to solve this:

import pandas as pd

df = pd.DataFrame(...)

df.to_csv('gs://bucket/path')
Vova Pytsyuk
  • 599
  • 4
  • 6
  • 4
    This is hilariously simple. Just make sure to also install `gcsfs` as a prerequisite (though it'll remind you anyway). If you're coming here in 2020 or later, just skip the complexity and do this. – bsplosion Dec 16 '21 at 04:59
  • 1
    Is there a way to make a saved file publically accessible directly by passing any argument? – Danish Bansal Jan 28 '22 at 09:31
  • 1
    It is not working. I have created a ubuntu server and installed `pip install pandas fsspec gcsfs`. I am able to read csv file using `pd.read_csv(gs://BUCKET_PATH)` but not able to write – Shiv Krishna Jaiswal Apr 09 '22 at 03:30
  • 1
    Got the answer of my own question. It is access issue. See this **[link](https://stackoverflow.com/a/60877896/5785058)** – Shiv Krishna Jaiswal Apr 09 '22 at 04:06
  • 1
    @bsplosion both `gcsfs` and `fsspec` are required as dependency. (May 2023) – 刘宇翔 May 29 '23 at 01:43
23

Try the following working example:

from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd

# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])

sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create storage bucket if it does not exist
if not sample_bucket.exists():
    sample_bucket.create()

# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)

I used this example, and the _table.py file from the datalab github site as a reference. You can find other datalab source code files at this link.

Chris Meyers
  • 1,426
  • 9
  • 14
Anthonios Partheniou
  • 1,699
  • 1
  • 15
  • 25
  • 1
    Just a note: I believe you need to execute the %%storage commands in a separate cell from the Python code? – dartdog Mar 31 '16 at 16:13
  • 3
    It depends on whether you want to execute a line magic or cell magic command. For cell magic it is %%storage, for line magic it is %storage. It's ok to use line magic commands in the same cell as other code. Cell magic commands must be in a separate cell from other code – Anthonios Partheniou Mar 31 '16 at 16:43
  • Thanks for the clarification – dartdog Mar 31 '16 at 17:24
  • Thanks very much Anthonios... I was able to successfully create all of the objects (e.g., the table and the schema are in my Project/Dataset in BQ). However, no rows were actually written to the table and no error messages were generated. – EcoWarrior Mar 31 '16 at 19:45
  • A populated table was generated in the Jupyter Notebook after table.Insert_data(out) and this line was at the bottom of that table: (rows: 0, edw-p19090000:ClickADS2.ADS_Logit1) – EcoWarrior Mar 31 '16 at 19:52
  • I found a similar stackoverflow question related to delayed data. Please check the solution at the following link to see if you are experiencing a similar issue: http://stackoverflow.com/questions/35656910/datalab-does-not-populate-bigquery-tables/35663982#35663982 – Anthonios Partheniou Mar 31 '16 at 23:52
  • It's very slow. The direct conversion from dataframe to bigquery. Anyways faster? – Elona Mishmika Feb 12 '18 at 11:33
  • One possible faster method: Write the csv to Google Cloud Storage first, then use the command line `bq` tool to load from GCS to BigQuery. You could also look into using Google Cloud Dataflow. – Anthonios Partheniou Feb 12 '18 at 12:01
  • I'm doing this now :) However, I found when I wrote it into GCS, it doesn't have the comma to seperate every columns. Do you have this problem too? – Elona Mishmika Feb 14 '18 at 00:15
  • IMPORTANT NOTE: google.datalab is now deprecated as per this note. Probably not the best package to use. https://pypi.org/project/datalab/ – Diana Vazquez Romo Nov 03 '22 at 14:57
17

Using the Google Cloud Datalab documentation

import datalab.storage as gcs
gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')
Jan Krynauw
  • 1,042
  • 10
  • 21
9

Writing a Pandas DataFrame to BigQuery

Update on @Anthonios Partheniou's answer.
The code is a bit different now - as of Nov. 29 2017

To define a BigQuery dataset

Pass a tuple containing project_id and dataset_id to bq.Dataset.

# define a BigQuery dataset    
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)

To define a BigQuery table

Pass a tuple containing project_id, dataset_id and the table name to bq.Table.

# define a BigQuery table    
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)

Create the dataset/ table and write to table in BQ

# Create BigQuery dataset
if not dataset.exists():
    dataset.create()

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)

# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)
Community
  • 1
  • 1
Ekaba Bisong
  • 2,918
  • 2
  • 23
  • 38
8

Since 2017, Pandas has a Dataframe to BigQuery function pandas.DataFrame.to_gbq

The documentation has an example:

import pandas_gbq as gbq gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')

Parameter if_exists can be set to 'fail', 'replace' or 'append'

See also this example.

intotecho
  • 4,925
  • 3
  • 39
  • 54
5

To save a parquet file in GCS with authentication due Service Account:

df.to_parquet("gs://<bucket-name>/file.parquet",
               storage_options={"token": <path-to-gcs-service-account-file>}
b0lle
  • 732
  • 6
  • 19
4

I have a little bit simpler solution for the task using Dask. You can convert your DataFrame to Dask DataFrame, which can be written to csv on Cloud Storage

import dask.dataframe as dd
import pandas
df # your Pandas DataFrame
ddf = dd.from_pandas(df,npartitions=1, sort=True)
dd.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,  
                               storage_options={'token': gcs.session.credentials})  
Porada Kev
  • 503
  • 11
  • 24
1

To Google storage:

def write_df_to_gs(df, gs_key):
    df.to_csv(gs_key)    

To BigQuery:

def upload_df_to_bq(df, project, bq_table):
    df.to_gbq(bq_table, project_id=project, if_exists='replace')
Franco Piccolo
  • 6,845
  • 8
  • 34
  • 52
0

I think you need to load it into a plain bytes variable and use a %%storage write --variable $sample_bucketpath(see the doc) in a separate cell... I'm still figuring it out... But That is roughly the inverse of what I needed to do to read a CSV file in, I don't know if it makes a difference on write but I had to use BytesIO to read the buffer created by the %% storage read command... Hope it helps, let me know!

dartdog
  • 10,432
  • 21
  • 72
  • 121