2

I am trying to stream data into BQ from a Scala application. Looking at the samples listed at Streaming Data Into BigQuery I see that the data needs to passed in as a Map<String, Object> using TableDataInsertAllRequest.Rows().setJson().

  1. Is this the only way to pass data in?
  2. Given that this represents data that will be streamed in as JSON by the BQ connector library is it possible to pass data in a JSONized string format instead of Map<String, Object>? If not is there any reason for this?
swish41ffl
  • 137
  • 1
  • 8
  • Sure. Hoping to get more answers/insights hopefully from the BQ team. https://cloud.google.com/bigquery/bigquery-api-quickstart says that Google Engineers monitor this tag. – swish41ffl Jan 25 '16 at 07:49
  • Well they do, but they might be not the ones who publish the connector libraries. Actually you should edit your original question and name the connector library you using. – Pentium10 Jan 25 '16 at 07:51

4 Answers4

1

Unfortunately the generated libraries over our (or any Google Cloud Platform) API don't support directly writing out the request body. It's likely this aids in ensuring the validity of requests. That said, there is active work on the client library front, and a helper method seems like a reasonable request. The overhead would likely still be present (parse to client representation) for the aforementioned validation purposes, but the client interface would be a bit simpler for your scenario.

I'll pass on your request. In the mean time, this question's answer mentions a library that seems like it will ease your translation work:

Convert Json to Map

Community
  • 1
  • 1
Sean Chen
  • 651
  • 4
  • 6
  • Oops. Just noticed you're working in Scala. Perhaps some equivalent library exists, or perhaps there is easy interop? Quick googling implies it should be: http://www.scala-lang.org/old/faq/4 – Sean Chen Jan 25 '16 at 19:14
  • My app is the last mile in a data pipeline before data is pushed to BQ. It converts incoming JSON to model (Scala case class) if data needs to be massaged further else can stream as is. I believe content being pushed is validated (against table schema) on BQ side and not in client library. Also, BQ client library would have to serialize the Map to some form (most likely JSON if client library uses REST) before transmission. So, I was thinking that having a method to pass in JSON string may simplify things if the client app just acts as a conduit. Will find a way to work around! – swish41ffl Jan 25 '16 at 21:22
  • You are correct that we validate the data structure against the table's schema. However, what I meant was just general request structure validation in the client -- well formatted json, well formatted body (since there are other fields like the insert id, etc). I've mentioned adding a thin helper method to our client library team, so we'll see where that goes. However, the library I mentioned above should at least let you call into the existing library with minimal logic on your end -- use the json-to-map library to get a map, set the map on the request. – Sean Chen Jan 25 '16 at 23:14
1

I think you should be able to stream json content via the BigQuery api in gcloud-java by using the TableDataWriteChannel.

Which means that it should also be doable without gcloud-java (and using the api-client directly) though you may need to repeat some code that the library is doing for you.

I highly recommend looking at gcloud-java and feel free to add a feature request for also supporting json content in the instertAll operation as well.

ozarov
  • 1,051
  • 6
  • 7
1

I also suggest you look at the BigQuery api in gcloud-java. In gcloud-java you can use a TableDataWriteChannel to stream data to a BigQuery table. See the following example (where JSON_CONTENT is a string of JSON):

BigQuery bigquery = BigQueryOptions.defaultInstance().service();
TableId tableId = TableId.of("dataset", "table");
LoadConfiguration configuration = LoadConfiguration.builder(tableId)
    .formatOptions(FormatOptions.json())
    .build();
try (TableDataWriteChannel channel = bigquery.writer(configuration)) {
  channel.write(
      ByteBuffer.wrap(JSON_CONTENT.getBytes(StandardCharsets.UTF_8)));
} catch (IOException e) {
  // handle exception
}

TableDataWriteChannel uses resumable upload to stream data to the BigQuery table, which makes it more suitable for big data large files.

A TableDataWriteChannel can also be used to stream local files:

int chunkSize = 8 * 256 * 1024;
BigQuery bigquery = BigQueryOptions.defaultInstance().service();
LoadConfiguration configuration = LoadConfiguration.builder(tableId)
    .formatOptions(FormatOptions.json())
    .build();
try (FileChannel fileChannel = FileChannel.open(Paths.get("file.json"))) {
  WriteChannel writeChannel = bigquery.writer(configuration);
  long position = 0;
  long written = fileChannel.transferTo(position, chunkSize, writeChannel);
  while (written > 0) {
    position += written;
    written = fileChannel.transferTo(position, chunkSize, writeChannel);
  }
  writeChannel.close();
}

For other examples on gcloud-java-bigquery you can have a look at BigQueryExample.

mziccard
  • 2,158
  • 9
  • 17
  • The "resumable upload" link describes the form of data load that does not use the BigQuery streaming ingestion API. It may be fine for swish41ffl's purposes. A couple things to note: 1) This will use your table's and project's daily load quota 2) Under the covers this executes a load job, and thus may take some time before the data becomes available 3) As with any load job, proper usage should involve polling the job state to verify successful completion. – Sean Chen Jan 28 '16 at 18:36
  • re: "suitability for big data" -- Data volume doesn't really apply here. The API to use really depends on requirements on freshness of data, ability to collect data into a single load (for efficiency sake), client side infrastructure, etc. Anecdotally, we have users ingest O(100s) TB of data per day via the streaming API. – Sean Chen Jan 28 '16 at 18:42
  • 1
    "suitability for big data": Maybe my statement was not clear, but quoting from the docs resumable upload link: "It is especially useful if you are transferring large files and the likelihood of a network interruption or some other transmission failure is high, for example, when uploading from a mobile client app." – mziccard Jan 28 '16 at 18:46
0
  1. That's the only way to stream data in. There is batch loading for large files documented here but for that you need to move the file to GCS and issue the import job from there.

  2. Well, for that the answer is that usually the BQ connector library handles the conversion, at least that's how it's working on Java and PHP, so instead of string you need pass objects.

Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • I understand data being represented as Map has its advantages. But if I have the data to be streamed as a JSON string I feel should be able to pass it in as is. In other words, I think the connector library additionally supporting such a method would avoid having to convert the JSON string to Map (in BQ client app) and then back to JSON (in BQ connector library). – swish41ffl Jan 25 '16 at 07:35
  • Find the connector library author and issue a feature request. – Pentium10 Jan 25 '16 at 07:41
  • :) Original intent of asking the second part of the question was to see if I am overlooking something obvious because this seems like a very basic requirement - the ability to pass JSON string, I mean. – swish41ffl Jan 25 '16 at 07:46
  • As far I know none of the libraries do that. So it may be just obvious for you. – Pentium10 Jan 25 '16 at 07:50