1

I have a set of 11mm documents in elasticsearch, each with an array of identifiers. Each identifier is a dict that contains a type, value, and date. Here is a sample record:

{
  "name": "Bob",
  "identifiers": [
    {
      "date": "2019-01-01",
      "type": "a",
      "value": "abcd"
    },
    {
      "date": "2019-01-01",
      "type": "b",
      "value": "efgh"
    }
  ]
}

I need to transfer these records nightly to a parquet data store where only the values of the identifiers are kept in an array. Like:

{
  "name": "Bob",
  "identifiers": ["abcd", "efgh"]
}

I am doing this by looping through all of the records and flattening the identifiers. Here is my flattening transformer:

    def _transform_identifier_values(self, identifiers: List[dict]):
        ret = [
            identifier['value']
            for identifier in identifiers
        ]
        return ret

This works but it's slow. Is there a a quicker way to do this? Possibly a native implementation I can tap into?

EDIT:

Tried the suggestions from Sunny. I was surprised to find that the original actually performs the best. My assumption was that itemgetter would have been more performant.

Here is how I tested it:

import time
from functools import partial
from operator import itemgetter


def main():

    docs = []
    for i in range(10_000_000):
        docs.append({
            'name': 'Bob',
            'identifiers': [
                {
                    'date': '2019-01-01',
                    'type': 'a',
                    'value': 'abcd'
                },
                {
                    'date': '2019-01-01',
                    'type': 'b',
                    'value': 'efgh'
                }
            ]
        })

    start = time.time()
    for doc in docs:
        _transform_identifier_values_original(doc['identifiers'])
    end = time.time()

    print(f'Original took {end-start} seconds')

    start = time.time()
    for doc in docs:
        _transform_identifier_values_getter(doc['identifiers'])
    end = time.time()

    print(f'Item getter took {end-start} seconds')

    start = time.time()
    for doc in docs:
        _transform_identifier_values_partial_lambda(doc['identifiers'])
    end = time.time()

    print(f'Lambda partial took {end-start} seconds')

    start = time.time()
    for doc in docs:
        _transform_identifier_values_partial(doc['identifiers'])
    end = time.time()

    print(f'Partial took {end-start} seconds')


def _transform_identifier_values_original(identifiers):
    ret = [
        identifier['value']
        for identifier in identifiers
    ]
    return ret


def _transform_identifier_values_getter(identifiers):
    return list(map(itemgetter('value'), identifiers))


def _transform_identifier_values_partial_lambda(identifiers):
    flatten_ids = partial(lambda o: list(map(itemgetter('value'), o)))
    return flatten_ids(identifiers)


def _transform_identifier_values_partial(identifiers):
    flatten = partial(map, itemgetter('value'))
    return list(flatten(identifiers))

if __name__ == '__main__':
    main()

Results:

Original took 4.6204328536987305 seconds

Item getter took 7.186180114746094 seconds

Lambda partial took 10.534514904022217 seconds

Partial took 9.07079291343689 seconds

micah
  • 7,596
  • 10
  • 49
  • 90
  • Take care, usually elastic doesnt return 100% documents! – Wonka Dec 30 '19 at 17:49
  • @Wonka If you're are using a scan it will return 100% of documents. A non-scan query will only return the top N documents. I'm scraping the entire index. – micah Dec 30 '19 at 18:17

2 Answers2

0

Well here is a solution i came up with:

def changeJSON(dictionary):
    new_dict = {'name': dictionary['name'], 'identifiers': []}
    for i in dictionary['identifiers']:
        new_dict['identifiers'].append(i['value'])
    return new_dict

This function will take in a single dictionary and returns the dictionary in your new desired format. You can then json.dumps() function from the built in json library. It takes in a list of dictionaries and dumps them into a json file.

SafeDev
  • 641
  • 4
  • 12
0

You could try and make use of operator.itemgetter.

from operator import itemgetter
def _transform_identifier_values(self, identifiers: List[dict]):
    return list(map(itemgetter('value'), identifiers))

Maybe even make a partial function out of it:

from operator import itemgetter
from functools import partial
flatten_ids = partial(lambda o: list(map(itemgetter('value'), o['identifiers'])))
print(flatten_ids(obj))

If you wanted to avoid the lambda, you can try:

flatten = partial(map, itemgetter('value'))
print(list(flatten(obj['identifiers'])))

I'm curious to see the results of these.

Sunny Patel
  • 7,830
  • 2
  • 31
  • 46