Create a BigQuery table from pandas dataframe, WITHOUT specifying schema explicitly

Question

I have a pandas dataframe and want to create a BigQuery table from it. I understand that there are many posts asking about this question, but all the answers I can find so far require explicitly specifying the schema of every column. For example:

from google.cloud import bigquery as bq

client = bq.Client()

dataset_ref = client.dataset('my_dataset', project = 'my_project')
table_ref = dataset_ref.table('my_table')  

job_config = bq.LoadJobConfig( 
 schema=[ 
     bq.SchemaField("a", bq.enums.SqlTypeNames.STRING),
     bq.SchemaField("b", bq.enums.SqlTypeNames.INT64), 
     bq.SchemaField("c", bq.enums.SqlTypeNames.FLOAT64),         
 ]
) 

client.load_table_from_dataframe(my_df, table_ref, job_config=job_config).result()

However, sometimes I have a dataframe of many columns (for example, 100 columns), it's really non-trival to specify all the columns. Is there a way to do it efficiently?

Btw, I found this post with similar question: Efficiently write a Pandas dataframe to Google BigQuery But seems like bq.Schema.from_dataframe does not exist:

AttributeError: module 'google.cloud.bigquery' has no attribute 'Schema'

Is writing your dataframe to a csv file and then autoloading the file into BigQuery an option? — Lak, Aug 01 '20 at 04:16
Thanks for the question. I can do that, but I'm trying to figure out a way that can store it to a BQ table directly. — user2830451, Aug 02 '20 at 06:55

Matteo Felici · Accepted Answer · 2020-08-03T07:23:04.490

19

Here's a code snippet to load a DataFrame to BQ:

import pandas as pd
from google.cloud import bigquery

# Example data
df = pd.DataFrame({'a': [1,2,4], 'b': ['123', '456', '000']})

# Load client
client = bigquery.Client(project='your-project-id')

# Define table name, in format dataset.table_name
table = 'your-dataset.your-table'

# Load data to BQ
job = client.load_table_from_dataframe(df, table)

If you want to specify only a subset of the schema and still import all the columns, you can switch the last row with

# Define a job config object, with a subset of the schema
job_config = bigquery.LoadJobConfig(schema=[bigquery.SchemaField('b', 'STRING')])

# Load data to BQ
job = client.load_table_from_dataframe(df, table, job_config=job_config)

edited Aug 03 '20 at 07:23

answered Aug 02 '20 at 12:13

Matteo Felici

1,037
10
19

Thanks. But what I'm looking for is a way to import ALL columns of a dataframe WITHOUT specifying schema. – user2830451 Aug 02 '20 at 20:34
If you do not want to specify any schema at all, you can still use the above code without defining the `job_config` parameter (I'm editing the answer for clarification). In the above example, I'm still importing both columns even specifying the schema for only one. – Matteo Felici Aug 03 '20 at 07:21
Thanks for the clarification. That works. I'm curious though: why the whole dataframe is still imported though the job_config just specified one column? – user2830451 Aug 05 '20 at 05:07
I'm not sure, but I think it works like the `dtype` parameter in `pandas.read_csv`: you can specify a subset of the schema, but by default it imports all the data guessing all the other column types. – Matteo Felici Aug 05 '20 at 15:17

score 2 · Answer 2 · answered Aug 04 '20 at 16:20

Here is the working code:

from google.cloud import bigquery
import pandas as pd

bigqueryClient = bigquery.Client()
tableRef = bigqueryClient.dataset("dataset-name").table("table-name")

dataFrame = pd.read_csv("file-name")

bigqueryJob = bigqueryClient.load_table_from_dataframe(dataFrame, tableRef)
bigqueryJob.result()

score 1 · Answer 3 · answered Dec 16 '22 at 19:08

Now it's as easy as installing pandas-gbq==0.18.1 and then


    df.to_gbq(
        destination_table="my_project_id.my_dataset.my_table",
        project_id="my_project_id",
        credentials=service_account.Credentials.from_service_account_info(
            my_service_account_info     # there are several ways to authenticate
        ),
    )

Docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html

See the How to authenticate with Google BigQuery guide for authentication instructions.

score 0 · Answer 4 · answered Aug 30 '23 at 17:07

0

import pandas as pd
# Crear un DataFrame de ejemplo
data = {'column1': [1, 2, 3], 'column2': ['A', 'B', 'C']}
df = pd.DataFrame(data)

# Definir el nombre de la tabla en BigQuery
table_id = 'nombre_del_dataset.nombre_de_la_tabla'

# Cargar el DataFrame en BigQuery
df.to_gbq(destination_table=table_id, project_id='tu-id-de-proyecto', 
if_exists='replace')

answered Aug 30 '23 at 17:07

Sebastian Alejandro Robles Wag

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Diego Borba Aug 31 '23 at 03:12

Create a BigQuery table from pandas dataframe, WITHOUT specifying schema explicitly

4 Answers4

Linked