I have a set of 11mm documents in elasticsearch, each with an array of identifiers. Each identifier is a dict that contains a type, value, and date. Here is a sample record:
{
"name": "Bob",
"identifiers": [
{
"date": "2019-01-01",
"type": "a",
"value": "abcd"
},
{
"date": "2019-01-01",
"type": "b",
"value": "efgh"
}
]
}
I need to transfer these records nightly to a parquet data store where only the values of the identifiers are kept in an array. Like:
{
"name": "Bob",
"identifiers": ["abcd", "efgh"]
}
I am doing this by looping through all of the records and flattening the identifiers. Here is my flattening transformer:
def _transform_identifier_values(self, identifiers: List[dict]):
ret = [
identifier['value']
for identifier in identifiers
]
return ret
This works but it's slow. Is there a a quicker way to do this? Possibly a native implementation I can tap into?
EDIT:
Tried the suggestions from Sunny. I was surprised to find that the original actually performs the best. My assumption was that itemgetter
would have been more performant.
Here is how I tested it:
import time
from functools import partial
from operator import itemgetter
def main():
docs = []
for i in range(10_000_000):
docs.append({
'name': 'Bob',
'identifiers': [
{
'date': '2019-01-01',
'type': 'a',
'value': 'abcd'
},
{
'date': '2019-01-01',
'type': 'b',
'value': 'efgh'
}
]
})
start = time.time()
for doc in docs:
_transform_identifier_values_original(doc['identifiers'])
end = time.time()
print(f'Original took {end-start} seconds')
start = time.time()
for doc in docs:
_transform_identifier_values_getter(doc['identifiers'])
end = time.time()
print(f'Item getter took {end-start} seconds')
start = time.time()
for doc in docs:
_transform_identifier_values_partial_lambda(doc['identifiers'])
end = time.time()
print(f'Lambda partial took {end-start} seconds')
start = time.time()
for doc in docs:
_transform_identifier_values_partial(doc['identifiers'])
end = time.time()
print(f'Partial took {end-start} seconds')
def _transform_identifier_values_original(identifiers):
ret = [
identifier['value']
for identifier in identifiers
]
return ret
def _transform_identifier_values_getter(identifiers):
return list(map(itemgetter('value'), identifiers))
def _transform_identifier_values_partial_lambda(identifiers):
flatten_ids = partial(lambda o: list(map(itemgetter('value'), o)))
return flatten_ids(identifiers)
def _transform_identifier_values_partial(identifiers):
flatten = partial(map, itemgetter('value'))
return list(flatten(identifiers))
if __name__ == '__main__':
main()
Results:
Original took 4.6204328536987305 seconds
Item getter took 7.186180114746094 seconds
Lambda partial took 10.534514904022217 seconds
Partial took 9.07079291343689 seconds