4

Currently streaming Change Data Capture events from MongoDB into snowflake, would like to apply them to the raw data that is already there.

Let's say I have a table like this:

+---------------------+-----------------+-----------+
|         key         |      value      | document  |
+---------------------+-----------------+-----------+
| foo.bar             | "changed value" | <variant> |
| foo.stuff.anArray.1 | 1000            | <variant> |
| ...                 | ...             | ...       |
+---------------------+-----------------+-----------+

Where variant contains a very heavily nested JSON ex:

{
    "foo": {
        "bar": "Some info",
        "baz": "Other info",
        "stuff": {
            "anArray": [1, 2, 3],
            "things": "More nested info"
        }
    }
}

I would like to use OBJECT_DELETE and OBJECT_INSERT functions to update this nested variant data in snowflake.

Tried making a js UDF but eval() is not supported.

Other approaches like writing a UDF that does key.split(".") and then recursively walking the structure and updating the field seem to take a long time and fail with JavaScript out of memory error: UDF thread memory limit exceeded in some cases.

Looking for a bit more efficient way to solve this issue.

Unnat
  • 398
  • 2
  • 12
ElvisElvis
  • 160
  • 1
  • 12
  • Can you be more specific/examples on how the object would be updated (in terms of data)? – Felipe Hoffa Mar 05 '21 at 20:59
  • 1
    @FelipeHoffa Sure, the key mentioned in the table is the path in json. In the first row "foo.bar" would update the "Some info" value in the json to "changed value" and in the second row the expected values in "anArray" would be `[1, 1000, 3]`. I the Idea was for it to function the same as `OBJECT_INSERT` but with the ability to go deeper into json. – ElvisElvis Mar 08 '21 at 09:20
  • do you mean, can someone write snowflake SQL to handling this: https://docs.mongodb.com/manual/reference/change-events/#change-stream-output – Simeon Pilgrim Mar 08 '21 at 21:10
  • I really want to help with this - I'm just having trouble understanding exactly what you want. – Felipe Hoffa Mar 09 '21 at 06:12
  • @SimeonPilgrim Yes indeed the idea is to apply the changes coming in the change stream to data already in the warehouse. The `variant` is the existing data and the `key, value` is the parsed information coming from mongodb CDC. – ElvisElvis Mar 09 '21 at 12:42
  • @FelipeHoffa Are you familiar how paths are notated in mongo db? Because the idea is simple just like the answer from Sergiu but in a more programmatic/dynamic way. Where the paths (keys) and the new values are not determined beforehand but come in as data from mongo cdc. They need to be applied to existing raw data that is stored in a variant form. – ElvisElvis Mar 09 '21 at 12:46

2 Answers2

1

I've faced a similar problem and used a generic UDF to solve it. Here is an example of a UDF implementation that will solve what you need:

create or replace function edit_nested_entity("variant_object" variant, "path" string, "value" string)
returns variant
language javascript
as
$$
// https://stackoverflow.com/questions/6491463/accessing-nested-javascript-objects-and-arrays-by-string-path?page=1&tab=votes#tab-top
    Object.byString = function(o, s) {
        s = s.replace(/\[(\w+)\]/g, '.$1'); // convert indexes to properties
        s = s.replace(/^\./, '');           // strip a leading dot
        var a = s.split('.');
        for (var i = 0, n = a.length; i < n; ++i) {
            var k = a[i];
            if (k in o) {
                o = o[k];
            } else {
                return;
            }
        }
        return o;
   }
   // get the entity base
   nested_entity = Object.byString(variant_object, path)
   // update the value
   nested_entity = value
   return variant_object;
$$;

And now you'll need to run the following SQL command to achieve what you need:

UPDATE t1
SET document = edit_nested_entity(document, key, value) 

You'll maybe do some fine-tuning for this UDF to be more generic (or to use different UDF for different data types), but this will work.

Shahar Glazner
  • 377
  • 4
  • 10
0

There is a way using OBJECT_INSERT but it's not pretty. Unfortunately I don't see a way to specify a nested key in a single OBJECT_INSERT. So:

create or replace table test2 (document variant);
insert into test2 select object_construct('foo',object_construct('bar','Some info', 'baz', 'Other info','stuff', object_construct('anArray', array_construct(1, 2, 3), 'things', 'More nested info')));
select * from test2;

I get:

{
      "foo": {
                "bar": "Some info",
                "baz": "Other info",
                "stuff": {
                          "anArray": [1,2,3],
                          "things": "More nested info"
                }
      }

}

Now, I want to update foo.bar with "Changed info" so I can do (remember to set the flag to TRUE so you get an update rather than insert):

update test2 set document = OBJECT_INSERT(document, 'foo', OBJECT_INSERT(document:foo::VARIANT, 'bar', 'Changed value', TRUE), TRUE) WHERE document:foo.bar::VARCHAR = 'Some info';

I get back:

{
          "foo": {
                    "bar": "Changed value",
                    "baz": "Other info",
                    "stuff": {
                              "anArray": [1,2,3],
                              "things": "More nested info"
                    }
          }
}

You can also use Javascript UDF as mentioned here.

Sergiu
  • 4,039
  • 1
  • 13
  • 21
  • Yeah, this works when we have specific keys to update, but unfortunately the keys are dynamic as are the values and variants. – ElvisElvis Mar 09 '21 at 12:40
  • 1
    If the keys are dynamic, can't you use a Javascript stored procedure and inject the keys in such an UPDATE statement? I imagine you don't want to rewrite the UPDATE statement every time.The only challenge I see is indeed how deep is the nested JSON, as you would need an OBJECT_INSERT per level, which is not very nice either. – Sergiu Mar 09 '21 at 12:48
  • Yeah that is the issue, I have a JS UDF for that, but sometimes it fails with `JavaScript out of memory error: UDF thread memory limit exceeded` so the execution is not consistent – ElvisElvis Mar 09 '21 at 12:52
  • 1
    Would you be able to post a sample of your JS UDF to see if we can recommend any improvement? – Sergiu Mar 09 '21 at 13:00
  • I've added a sample of UDF that uses [another concept](https://stackoverflow.com/questions/6491463/accessing-nested-javascript-objects-and-arrays-by-string-path?page=1&tab=votes#tab-top) to achieve dynamic nested variant update, and that's what we actually use in our system so you can make it work for you. – Shahar Glazner Mar 15 '21 at 07:59