10

I am having trouble joining on a repeated nested field while still preserving the original row structure in BigQuery.

For my example I'll call the two tables being joined A and B.

Records in table A look something like:

{
  "url":"some url",
  "repeated_nested": [
    {"key":"some key","property":"some property"}
  ]
}

and records in table B look something like:

{
  "key":"some key",
  "property2": "another property"
}

I am hoping to find a way to join this data together to generate a row that looks like:

{
  "url":"some url",
  "repeated_nested": [
    {
      "key":"some key",
      "property":"some property",
      "property2":"another property"
    }
  ]
}

The very first query I tried was:

SELECT 
  url, repeated_nested.key, repeated_nested.property, repeated_nested.property2
FROM A
  AS lefttable
LEFT OUTER JOIN B
  AS righttable
  ON lefttable.key=righttable.key

This doesn't work because BQ can't join on repeated nested fields. There is not a unique identifier for each row. If I were to do a FLATTEN on repeated_nested then I'm not sure how to get the original row put back together correctly.

The data is such that a url will always have the same repeated_nested field with it. Because of that, I was able to make a workaround using a UDF to sort of roll up this repeated nested object into a JSON string and then unroll it again:

SELECT url, repeated_nested.key, repeated_nested.property, repeated_nested.property2

FROM
JS(
   (
    SELECT basetable.url as url, repeated_nested
    FROM A as basetable

    LEFT JOIN (
      SELECT url, CONCAT("[", GROUP_CONCAT_UNQUOTED(repeated_nested_json, ","), "]") as repeated_nested
      FROM
      (
        SELECT
          url,
            CONCAT(
              '{"key": "', repeated_nested.key, '",',
              ' "property": "', repeated_nested.property, '",',
              ' "property2": "', mapping_table.property2, '"',
              '}'
            )
          ) as repeated_nested_json
        FROM (
          SELECT
            url, repeated_nested.key, repeated_nested.property
          FROM A
          GROUP BY url, repeated_nested.key, repeated_nested.property
        ) as urltable

         LEFT OUTER JOIN [SDF.alchemy_to_ric]
            AS mapping_table
            ON urltable.repeated_nested.key=mapping_table.key
      )
      GROUP BY url
    ) as companytable
    ON basetable.url = urltable.url
  ),

  // input columns:
  url, repeated_nested_json,

  // output schema:
  "[{'name': 'url', 'type': 'string'},
   {'name': 'repeated_nested_json', 'type': 'RECORD', 'mode':'REPEATED', 'fields':
    [ { 'name': 'key', 'type':'string' },
      { 'name': 'property', 'type':'string' },
      { 'name': 'property2', 'type':'string' }]
   }]",

  // UDF:
  "function(row, emit) {
    parsed_repeated_nested = [];
    try {
      if ( row.repeated_nested_json != null ) {
          parsed_repeated_nested = JSON.parse(row.repeated_nested_json);
      }
    } catch (ex) { }

    emit({
      url: row.url,
      repeated_nested: parsed_repeated_nested
    });
  }"
)

This solution works fine for small tables. But the real life tables I'm working with have many more columns than in my example above. When there are other fields in addition to url and repeated_nested_json they all have to be passed through the UDF. When I work with tables that are around the 50 gb range everything is fine. But when I apply the UDF and query to tables that are 500-1000 gb, I get an Internal Server Error from BQ.

In the end I just need all of the data in new line delimited JSON format in GCS. As a last ditch effort I tried concatenating all of the fields into a JSON string (so that I only had 1 column) in the hopes that I could export it as CSV and have what I need. However, the export process escaped the double quotes and adds double quotes around the JSON string. According to the BQ docs on jobs (https://cloud.google.com/bigquery/docs/reference/v2/jobs) there is a property configuration.query.tableDefinitions.(key).csvOptions.quote that could help me. But I can't figure out how to make it work.

Does anybody have advice on how they have dealt with this sort of situation?

Andrew Backes
  • 101
  • 1
  • 4

2 Answers2

1

I have never had to do this, but you should be able to use flatten, then join, then use nest to get repeated fields again.

The docs state that BigQuery always flattens query results, but that appears to be false: you can choose to not have results flattened if you set a destination table. You should then be able to export that table as JSON to Storage.

See also this answer for how to get nest to work.

Community
  • 1
  • 1
oulenz
  • 1,199
  • 1
  • 15
  • 24
  • 1
    One tricky detail: There's no clean way to use `NEST` to get back a repeated record--you can only use `NEST` on the leaf fields to get individual repeated records back. To work around this, you might consider smooshing the record into a string (as mentioned in the question), using `NEST` to turn it back into a repeated field, and then using a UDF to unpack the record. – Jeremy Condit Mar 11 '16 at 16:59
  • 1
    BTW, we have improvements to our SQL dialect coming soon that will make all of this a lot easier. Follow https://code.google.com/p/google-bigquery/issues/detail?id=448 to get notified. – Jeremy Condit Mar 11 '16 at 17:00
  • Also, `NEST` has to be used with a `GROUP BY`. In my case, I haven't been able to concoct a field to do use a `GROUP BY` on. – Andrew Backes Mar 11 '16 at 18:23
  • Jeremy - The pack/unpack json with a UDF method is actually the closest thing I have to a working solution. However, when I use it on my 500gb-1000gb table BQ responds with: "Internal Server Error" – Andrew Backes Mar 11 '16 at 18:27
  • Andrew, what about `url`? If that is not unique per row, you could (before you `flatten`) create a unique column using `row_number() over ()`. – oulenz Mar 12 '16 at 13:59
  • I'm looking into the UDF issue. We've gotten a few reports of UDFs OOMing on large input tables; your case is odd in that you're getting an internal error instead of an OOM or timeout. I'm trying to repro now. – thomaspark Mar 15 '16 at 20:11
  • I reran the query to repro and got a timeout, which makes more sense. I had to remove a LEFT JOIN from the input query to the JS function to get it to run since one of the tables has been dropped; the input query without the join produces over 500MM rows, which may run afoul of the time limits in https://cloud.google.com/bigquery/user-defined-functions#limits If you want to regenerate the filter table, I can retry. – thomaspark Mar 15 '16 at 20:35
0

@AndrewBackes - we rolled out some fixes for UDF memory-related issues this week; there are some details on the root cause here https://stackoverflow.com/a/36275462/5265394 and here https://stackoverflow.com/a/35563562/5265394.

The UDF version of your query is now working for me; could you verify on your side?

Community
  • 1
  • 1
thomaspark
  • 488
  • 3
  • 14