0

I want to load data from hundreds of CSV files on Google cloud Storage and append them to a single table on Bigquery on a daily basis using cloud dataflow (preferable using python SDK). Can you please let me know how I Can accomplish that?

Thanks

Parth Desai
  • 49
  • 1
  • 3
  • 1
    What exactly is the question? Yes, you can use the Python SDK for Dataflow to load to a BigQuery table. Or (my preference) just load into BigQuery and then run a query to transform. – Elliott Brossard May 20 '18 at 18:50
  • Hi Elliott! I want to run a daily batch job which loads data from csv files in cloud storage to a bigquery table..I agree with your point but even for direct load, the dataflow code needs to convert csv line data to JSON (key value pairs) to be written to bigquery..bigquery only accepts key value pairs when you use apache beam (dataflow SDK) – Parth Desai May 20 '18 at 19:21
  • It sounds very similar to [this answer](https://stackoverflow.com/a/45108626/6253347). – Elliott Brossard May 20 '18 at 19:25
  • Yes but its in java..I am more familiar with python. Any resources for python to convert pcollection to tablerow? – Parth Desai May 20 '18 at 19:46
  • https://stackoverflow.com/a/47379508/6253347 – Elliott Brossard May 20 '18 at 19:57
  • Did you have a look at the question linked by Elliott Brossard? Did this solve your issue? – arudzinska Jun 21 '18 at 15:41

1 Answers1

0

We can do it through Python as well. Please find the below code snippet.

def format_output_json(element):
    """
    :param element: is the row data in the csv
    :return: a dictionary with key as column name and value as real data in a row of the csv.

    :row_indices: I have hard-coded here, but can get it at the run time.
    """
    row_indices = ['time_stamp', 'product_name', 'units_sold', 'retail_price']
    row_data = element.split(',')
    dict1 = dict()
    for i in range(len(row_data)):
        dict1[row_indices[i]] = row_data[i]

    return [dict1]
Eric Aya
  • 69,473
  • 35
  • 181
  • 253
Atul Anand
  • 29
  • 6