1

I currently have a function that is supposed to update every dictionary key where the value is the old _id to be the value of a new uuid.

This function works as intended but isn't efficient and takes too long to execute with my current dataset.

What has to happen:

  1. Each dict['_id']['$oid'] has to be changed to a new uuid.
  2. Every $oid that matches the old value must be changed to new uuid.

The problem is that there are many $oid keys in every dictionary and I want to make sure this works correctly while being more efficient

As of right now the script takes 45-55 seconds to execute with my current dataset and it's making 45 million calls to the recursive function recursive_id_replace

How can I optimize the current function to get this script to run faster?

The data gets passed into this functions as a dictionary that looks like:

file_data = {
    'some_file_name.json': [{...}],
    # ...
}

Here's an example of one of the dictionaries in the dataset:

[
  {
    "_id": {
      "$oid": "4e0286ed2f1d40f78a037c41"
    },
    "code": "HOME",
    "name": "Home Address",
    "description": "Home Address",
    "entity_id": {
      "$oid": "58f4eb19736c128b640d5844"
    },
    "is_master": true,
    "is_active": true,
    "company_id": {
      "$oid": "591082232801952100d9f0c7"
    },
    "created_at": {
      "$date": "2017-05-08T14:35:19.958Z"
    },
    "created_by": {
      "$oid": "590b7fd32801952100d9f051"
    },
    "updated_at": {
      "$date": "2017-05-08T14:35:19.958Z"
    },
    "updated_by": {
      "$oid": "590b7fd32801952100d9f051"
    }
  },
  {
    "_id": {
      "$oid": "01c593700e704f29be1e3e23"
    },
    "code": "MAIL",
    "name": "Mailing Address",
    "description": "Mailing Address",
    "entity_id": {
      "$oid": "58f4eb1b736c128b640d5845"
    },
    "is_master": true,
    "is_active": true,
    "company_id": {
      "$oid": "591082232801952100d9f0c7"
    },
    "created_at": {
      "$date": "2017-05-08T14:35:19.980Z"
    },
    "created_by": {
      "$oid": "590b7fd32801952100d9f051"
    },
    "updated_at": {
      "$date": "2017-05-08T14:35:19.980Z"
    },
    "updated_by": {
      "$oid": "590b7fd32801952100d9f051"
    }
  }
]

Heres the function:

def id_replace(_ids: set, uuids: list, file_data: dict) -> dict:
    """ Replaces all old _id's with new uuid. 
        checks all data keys for old referenced _id and updates it to new uuid value.
    """

    def recursive_id_replace(prev_val, new_val, data: Any):
        """
        Replaces _ids recursively.
        """
        data_type = type(data)

        if data_type == list:
            for item in data:
                recursive_id_replace(prev_val, new_val, item)

        if data_type == dict:
            for key, val in data.items():

                val_type = type(val)

                if key == '$oid' and val == prev_val:
                    data[key] = new_val

                elif val_type == dict:
                    recursive_id_replace(prev_val, new_val, val)

                elif val_type == list:
                    for item in val:
                        recursive_id_replace(prev_val, new_val, item)

    for i, _id in enumerate(_ids):
        recursive_id_replace(_id, uuids[i], file_data)

    return file_data
Sean Parsons
  • 724
  • 4
  • 12
  • 1
    Does this help? https://stackoverflow.com/questions/45253984/how-to-replace-all-instances-of-a-keyword-in-arbitrarily-structured-data/45254036#45254036 – cs95 Jul 22 '17 at 18:53
  • Suggest you first profile your code to determine where to try optimizing it. See [**_How can you profile a script?_**](https://stackoverflow.com/questions/582336/how-can-you-profile-a-script) – martineau Jul 22 '17 at 19:08
  • @martineau I've profilied this script already, that's how I know the function is being called 45 million. – Sean Parsons Jul 22 '17 at 19:35
  • @cᴏʟᴅsᴘᴇᴇᴅ I don't know if it's the best way. Although it's a different way of solving this problem and i'm going to attempt to tweak the code to deal with this problem as strings just like in your answer in the link. – Sean Parsons Jul 22 '17 at 19:36
  • It's possible the overhead of recursion is the issue, but that's not all the function does. There's also a module called the [`line_profiler`](https://pypi.python.org/pypi/line_profiler/2.0) that will give you more insight into the inner workings function's code, which I also suggest using, it's fairly easy to use. Also, by inspection I think I see something _very_ inefficient (as well as being a potential bug) going on, but need to run the code to make sure. Could you add a simple testcase that has a minimal number of `_ids` and `uuids` that correspond to the what in the sample json data? – martineau Jul 22 '17 at 19:41
  • Another question: Why are you passing the `file_data` argument supplied to `id_replace()` on to `recursive_id_replace()`? The function doesn't do anything with it except pass it to itself (recursively)? – martineau Jul 22 '17 at 19:51
  • 1
    @cᴏʟᴅsᴘᴇᴇᴅ I tried the way you showed in that ticket by marshalling data into a str and using .replace and I gained a 50X speed up in execution. I was getting between 45-55 seconds and now the script executes in 1 second. If you want to create an answer, I'll make sure to select it and upvote it as well. – Sean Parsons Jul 23 '17 at 03:30
  • 1
    Alright done :) – cs95 Jul 23 '17 at 04:07

1 Answers1

1

Here's one technique that involves stringifying your data with json and then doing the replacement.

In [415]: old_uid = "590b7fd32801952100d9f051"

In [416]: new_uid = "test123"

In [417]: json.loads(json.dumps(data).replace('"$oid": "%s"' %old_uid, '"$oid": "%s"' %new_uid))
Out[417]: 
[{'_id': {'$oid': '4e0286ed2f1d40f78a037c41'},
  'code': 'HOME',
  'company_id': {'$oid': '591082232801952100d9f0c7'},
  'created_at': {'$date': '2017-05-08T14:35:19.958Z'},
  'created_by': {'$oid': 'test123'},
  'description': 'Home Address',
  'entity_id': {'$oid': '58f4eb19736c128b640d5844'},
  'is_active': True,
  'is_master': True,
  'name': 'Home Address',
  'updated_at': {'$date': '2017-05-08T14:35:19.958Z'},
  'updated_by': {'$oid': 'test123'}},
 {'_id': {'$oid': '01c593700e704f29be1e3e23'},
  'code': 'MAIL',
  'company_id': {'$oid': '591082232801952100d9f0c7'},
  'created_at': {'$date': '2017-05-08T14:35:19.980Z'},
  'created_by': {'$oid': 'test123'},
  'description': 'Mailing Address',
  'entity_id': {'$oid': '58f4eb1b736c128b640d5845'},
  'is_active': True,
  'is_master': True,
  'name': 'Mailing Address',
  'updated_at': {'$date': '2017-05-08T14:35:19.980Z'},
  'updated_by': {'$oid': 'test123'}}]

Just remember, only works with lists and dictionaries. Other data structures will either be implicitly converted or an error thrown.

More methods outlined here.

cs95
  • 379,657
  • 97
  • 704
  • 746