6

I'm trying to migrate csv files from Google Cloud Storage (GCS), which have been exported from BigQuery, to a PostgreSQL Google cloud sql instance using a python script.

I was hoping to use the Google API but found this in the documentation:

Importing CSV data using the Cloud SQL Admin API is not supported for PostgreSQL instances.

As an alternative I could use psycopg2 library and stream the rows of the csv file into the SQL instance. I can do this three ways

  • Line by line: Read each line and then submit the insert command and then commit
  • Batch stream: Read each line and then submit the insert commands and then commit after 10 lines or 100 etc.
  • The entire csv: Read each line and submit the insert commands and then only commit at the end of the document.

My concerns are these csv files could contain millions of rows and running this process for any of the three options mentioned above seems like a bad idea to me.

What alternatives do I have? Essentially I have some raw data in BigQuery on which we do some preprocessing before exporting to GCS in preparation for importing to the PostgreSQL instance. I need to export this preprocessed data from BigQuery to the PostgreSQL instance.

This is not a duplicate of this question as I'm preferably looking for the solution which exports data from BigQuery to the PostgreSQL instance wether it be via GCS or direct.

DJ319
  • 183
  • 1
  • 8
  • 1
    Why not use Cloud Dataflow? Sounds like a good option for this. – Graham Polley Oct 03 '18 at 09:23
  • I don't have a good reason not too. This was supposed to be part of a quick and dirty test for another part of the project. Was hoping to do it without having set up a data flow pipeline. I've never used Dataflow before. – DJ319 Oct 03 '18 at 09:30
  • It would be a very simple pipeline by the sounds of it. The nice thing is that it will scale for you and has native sources/sinks for BigQuery and CloudSQL. – Graham Polley Oct 03 '18 at 09:34
  • As an alternative, you can load dataset into pandas and it has it's own methods to send them to an SQL connection like psycopg2. – Eir Nym Oct 03 '18 at 09:51
  • Possible duplicate of [How to import CSV file data into a PostgreSQL table?](https://stackoverflow.com/questions/2987433/how-to-import-csv-file-data-into-a-postgresql-table) – Eir Nym Oct 03 '18 at 09:54
  • Also on a side note does anyone know why importing csv's isn't supported for PostgreSQL instances? – DJ319 Oct 03 '18 at 11:37

4 Answers4

3

You can do the import process with Cloud Dataflow as suggested by @GrahamPolley. It's true that this solution involves some extra work (getting familiar with Dataflow, setting everything up, etc). Even with the extra work, this would be the preferred solution for your situation. However, other solutions are available and I'll explain one of them below.

To set up a migration process with Dataflow, this tutorial about exporting BigQuery to Google Datastore is a good example


Alternative solution to Cloud Dataflow

Cloud SQL for PostgreSQL doesn't support importing from a .CSV but it does support .SQL files.

The file type for the specified uri.
SQL: The file contains SQL statements.
CSV: The file contains CSV data. Importing CSV data using the Cloud SQL Admin API is not supported for PostgreSQL instances.

A direct solution would be to convert the .CSV filest to .SQL with some tool (Google doens't provide one that I know of, but there are many online) and then import to the PostgreSQL.

If you want to implement this solution in a more "programatic" way, I would suggest to use Cloud Functions, here is an example of how I would try to do it:

  1. Set up a Cloud Function that triggers when a file is uploaded to a Cloud Storage bucket
  2. Code the function to get the uploaded file and check if it's a .CSV. If it is, use a csv-to-sql API (example of API here) to convert the file to .SQL
  3. Store the new file in Cloud Storage
  4. Import to the PostgreSQL
Guillermo Cacheda
  • 2,162
  • 14
  • 23
  • Thank you for your answer. I agree that setting up a dataflow pipeline is the most correct way to solve this. That is why I will mark your answer as accepted. I have found another way to do it which I will detail in an answer which allowed me to use the rest of the code I had already written. – DJ319 Oct 03 '18 at 11:12
3

Before you begin, you should make sure:

The database and table you are importing into must already exist on your Cloud SQL instance.

CSV file format requirements CSV files must have one line for each row of data and have comma-separated fields.

Then, you can import data to a Cloud SQL instance using a CSV file present in a GCS bucket following the next steps [GCLOUD]

  1. Describe the instance you are exporting from:

gcloud sql instances describe [INSTANCE_NAME]

  1. Copy the serviceAccountEmailAddress field.

  2. Add the service account to the bucket ACL as a writer:

gsutil acl ch -u [SERVICE_ACCOUNT_ADDRESS]:W gs://[BUCKET_NAME]

  1. Add the service account to the import file as a reader:

gsutil acl ch -u [SERVICE_ACCOUNT_ADDRESS]:R gs://[BUCKET_NAME]/[IMPORT_FILE_NAME]

  1. Import the file

gcloud sql import csv [INSTANCE_NAME] gs://[BUCKET_NAME]/[FILE_NAME] \ --database=[DATABASE_NAME] --table=[TABLE_NAME]

  1. If you do not need to retain the permissions provided by the ACL you set previously, remove the ACL:

gsutil acl ch -d [SERVICE_ACCOUNT_ADDRESS] gs://[BUCKET_NAME]

Tiago Martins Peres
  • 14,289
  • 18
  • 86
  • 145
0

I found that the pyscopg2 module has copy_from() which allows the loading of an entire csv file instead of streaming the rows individually. The downside of using this method is that the csv file still needs to be downloaded from the GCS and stored locally.

here are the details of using pyscopg2 'copy_from()'. (From here)

import psycopg2

conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
with open('user_accounts.csv', 'r') as f:
    # Notice that we don't need the `csv` module.
    next(f)  # Skip the header row.
    cur.copy_from(f, 'users', sep=',')

conn.commit()
DJ319
  • 183
  • 1
  • 8
0

You could just use a class to make the text you are pulling from the internet behave like a file. I have used this several times.

import io
import sys


class IteratorFile(io.TextIOBase):
    """ given an iterator which yields strings,
    return a file like object for reading those strings """

    def __init__(self, obj):
        elements = "{}|" * len(obj[0])
        elements = (unicode(elements[:-1]).format(*x) for x in obj)
        self._it = elements
        self._f = io.cStringIO()

    def read(self, length=sys.maxsize):

        try:
            while self._f.tell() < length:
                self._f.write(next(self._it) + "\n")

        except StopIteration as e:
            # soak up StopIteration. this block is not necessary because
            # of finally, but just to be explicit
            pass

        except Exception as e:
            print("uncaught exception: {}".format(e))

        finally:
            self._f.seek(0)
            data = self._f.read(length)

            # save the remainder for next read
            remainder = self._f.read()
            self._f.seek(0)
            self._f.truncate(0)
            self._f.write(remainder)
            return data

    def readline(self):
        return next(self._it)
eatmeimadanish
  • 3,809
  • 1
  • 14
  • 20