1

I need to write to BigQuery from PubSub in Python. I tested some async subscriber code and it works fine. But this needs to run continuously and I am not 100% sure where to schedule this. I have been using Cloud Composer (Airflow) but it doesn't look like an ideal fit and it looks like Dataflow is the one recommended by GCP? Is that correct?

Or is there a way to run this from Cloud Composer reliably? I think I can run it once but I want to make sure it runs again in case it fails for some reason.

kee
  • 10,969
  • 24
  • 107
  • 168
  • To my understanding, you want something to continuously run, and when a Pub/Sub message arrives, write to BigQuery? Am I correct? – Maxim Nov 25 '18 at 12:15
  • Either is fine but knowing both would be great! @Maxim – kee Nov 25 '18 at 19:32
  • Perhaps a Cloud Function could do the job then. Take a look at a recent answer I gave to a similar question: https://stackoverflow.com/questions/53442893/how-do-i-load-a-file-from-cloud-storage-into-memory/53446007#53446007 – Maxim Nov 25 '18 at 19:44
  • 1
    How often are these messages arriving? If it's a regular stream of events, then Cloud Functions is not the right tool. Instead, use Dataflow. – Graham Polley Nov 26 '18 at 01:44
  • Depends on how frequently you want stuff to be posted to BQ. Dataflow for near real-time streaming ingests. If you want to periodically ingest at certain intervals, you can create kubernetes cron jobs to run your ingest script time to time. – khan Nov 26 '18 at 03:43

1 Answers1

0

The two best ways to accomplish this goal would be by either using Cloud Functions or by using Cloud Dataflow. For Cloud Functions, you would set up a trigger on the Pub/Sub topic and then in your code write to BigQuery. It would look similar to the tutorial on streaming from Cloud Storage to BigQuery, except the input would be Pub/Sub messages. For Dataflow, you could use one of the Google-provided, open-source templates to write Pub/Sub messages to BigQuery.

Cloud Dataflow would probably be better suited if your throughput is high (thousands of messages per second) and consistent. If you have low or infrequent throughput, Cloud Functions would likely be a better fit. Either of these solutions would run constantly and write the messages to BigQuery when available.

Kamal Aboul-Hosn
  • 15,111
  • 1
  • 34
  • 46