0

I am trying to execute a query over a table in BigQuery using its Java client libraries. I create a Job and then get the result of Job using job.getQueryResults().iterateAll() method.

This way is working but for large data like 600k it takes time around 80-120 seconds. I see BigQuery gets data in 40-45k batches which takes around 5-7 sec each.

I want to get the results faster and I found over internet that if we can get the temporary table created by BigQuery from the Job and the read the data in avro or some other format from that table if will be really fast, but in BigQuery API(using version: 1.124.7) I don't see that way. Does anyone know how to do that in Java, or how to get data faster in case of large number of records. Any help is appreciated.

Code to Read Table(Takes 20 sec)

Table table = bigQueryHelper.getBigQueryClient().getTable(TableId.of("project","dataset","table"));
    String format = "CSV";
    String gcsUrl = "gs://name/test.csv";
    Job job = table.extract(format, gcsUrl);
    // Wait for the job to complete
    try {
      Job completedJob = job.waitFor(RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
              RetryOption.totalTimeout(Duration.ofMinutes(3)));
      if (completedJob != null && completedJob.getStatus().getError() == null) {
        log.info("job done");
        // Job completed successfully
      } else {
        log.info("job has error");
        // Handle error case
      }
    } catch (InterruptedException e) {
      // Handle interrupted wait
    }

Code to read same table using Query(Takes 90 Sec)

Job job = bigQueryHelper.getBigQueryClient().getJob(JobId.of(jobId));
      for (FieldValueList row : job.getQueryResults().iterateAll()) {
        System.out.println(row);
}
Pankaj Bajpai
  • 237
  • 1
  • 4
  • 15
  • May I ask you to send a link to the source of information of “... I found over internet that if we can get the temporary table created by BigQuery from the Job and the read the data in avro or some other format from that table if will be really fast“, please? – Kyrylo Bulat Jan 07 '21 at 07:24
  • Do not confuse reading a table that is already created versus executing a query that creates a table and then reading that table. The total aggregate time will be similar. – John Hanley Jan 07 '21 at 07:26
  • @JohnHanley Actually I see the different in time in both the approach. Added code in my description reads the same table(takes 20 sec) , which If I read using query it takes time 90 sec. – Pankaj Bajpai Jan 07 '21 at 15:21
  • @KyryloBulat This link talk something like this and I found one more link like this, but these are old API not new https://stackoverflow.com/questions/14807752/how-do-i-request-paginated-bigquery-query-results-using-pagetokens-with-the-goog – Pankaj Bajpai Jan 07 '21 at 15:31

1 Answers1

1

I tried certain ways and based on that found the best way of doing it, just thought to post here to help some one in future.

1: If we use job.getQueryResults().iterateAll() on job or directly on table, it takes same time. So if we don't give batch size BigQuery will use batch size of around 35-45k and fetch the data. So for 600k rows (180Mb) it takes 70-100 sec.

2: We can use the temp table details from created job and use extract job feature of table to write the result in GCS, this will be faster and takes around 30-35 sec. This approach would not download on local for that we again need to use ..iterateAll() on temp table and it will be take same time as 1.

Example pseudo code:

try {
      Job job = getBigQueryClient().getJob(JobId.of(jobId));
      long start = System.currentTimeMillis();
      //      FieldList list = getFields(job);

      Job completedJob =
          job.waitFor(
              RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
              RetryOption.totalTimeout(Duration.ofMinutes(3)));
      if (completedJob != null && completedJob.getStatus().getError() == null) {
        log.info("job done");
        String gcsUrl = "gs://bucketname/test";
        //getting the temp table information of the Job
        TableId destinationTableInfo =
            ((QueryJobConfiguration) job.getConfiguration()).getDestinationTable();
        log.info("Total time taken in getting schema ::{}", (System.currentTimeMillis() - start));
        Table table = bigQueryHelper.getBigQueryClient().getTable(destinationTableInfo);
        //Using extract job to write the data in GCS
        Job newJob1 =
            table.extract(
                CsvOptions.newBuilder().setFieldDelimiter("\t").build().toString(), gcsUrl);
        System.out.println("DestinationInfo::" + destinationTableInfo);
        Job completedJob1 =
            newJob1.waitFor(
                RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
                RetryOption.totalTimeout(Duration.ofMinutes(3)));
        if (completedJob1 != null && completedJob1.getStatus().getError() == null) {
          log.info("job done");
        } else {
          log.info("job has error");
        }
      } else {
        log.info("job has error");
      }
    } catch (InterruptedException e) {
      e.printStackTrace();
    }

3: This is the best way which I wanted. It downloads/writes the result faster in local file. It downloads data in around 20 sec. This is the new way BigQuery provides and can be checked using below links:

https://cloud.google.com/bigquery/docs/reference/storage/libraries#client-libraries-install-java

Pankaj Bajpai
  • 237
  • 1
  • 4
  • 15