1

I have been trying to build a pipeline in Google Cloud Data Fusion where data source is a 3rd party API endpoint. I have been unable to successfully use the HTTP Plugin, but it has been suggested that I use Pub/Sub for the data ingest.

I've been trying to follow this tutorial as a starting point, but it doesn't help me out with the very first step of the process: ingesting data from API endpoint.

Can anyone provide examples of using Pub/Sub -- or any other viable method -- to ingest data from an API endpoint and send that data down to Data Fusion for transformation and ultimately to BigQuery?

I will also need to be able to dynamically modify the URI (e.g., date filter parameters) in the GET request in this pipeline.

Korean_Of_the_Mountain
  • 1,428
  • 3
  • 16
  • 40
  • I found two topics in the documentation which I think it might help you. One [1] explains the Data Fusion Rest API, which has links to other doccuments. And the second [2] explains the HTTP Guidelines. Link: [1] https://cloud.google.com/data-fusion/docs/reference/rest/ , [2] https://cloud.google.com/apis/docs/http#working_with_wire_protocols_http – Alexandre Moraes Feb 10 '20 at 10:40
  • Correct me if I'm wrong, but those links seem to be focused on how to use Google Service APIs. I'm not trying to figure out how to use an API to communicate with a Google service. i'm trying to figure out how to get a Google service to communicate with a 3rd party API that is my data source. – Korean_Of_the_Mountain Feb 10 '20 at 16:17
  • @Korean_Of_the_Mountain did you ever solve this? I also saw your other question about Data Fusion. The real question seems to be, should we just use GCE or App Engine and manually set everything up using the BigQuery library, or is there another service that is more narrowly tailored to this use case that already has the right IAM roles and APIs enabled? – Chris Chiasson Mar 19 '21 at 14:33
  • I gave up on using DF as pipeline tool and decided to use Composer/Airflow instead. After using Cloud Functions some more, I think that could be a possible workaround if you really want to use DF. Create a CF with an HTTP trigger that calls API and stores API response in GCS. Have DF pipeline call the CF via HTTP, and then import the file that the CF saves in GCS and go to BQ from there. – Korean_Of_the_Mountain Mar 22 '21 at 22:16

1 Answers1

0

In order to achieve the first step in the tutorial you are following

Ingest CSV (Comma-separated values) data to BigQuery using Cloud Data Fusion.

You need to set up a functioning pub/sub system. This can be done via the command line, the console, or in your case the best would be to use, one of the client libraries. If you follow this tutorial you should have a functioning pub/sub system.

At that point you should be able to follow the original tutorial

Paddy Popeye
  • 1,634
  • 1
  • 16
  • 29
  • So the tutorial isn't doing exactly what I want because it reads data from csv. Assuming I already have a pub/sub system set up, what I'm still confused by is how do I get pub/sub to ingest data from the API endpoint? I don't see in the docs where pub/sub has the functionality to generate a request URI and save the response to be used later in the pipeline. – Korean_Of_the_Mountain Feb 10 '20 at 16:15
  • I think you need to provide more information about your 3rd party API. But data ingestion as per the documentation is possible via REST API call https://cloud.google.com/pubsub/ – Paddy Popeye Feb 11 '20 at 08:39