-1

I have a list of dictionaries. From each of the dictionaries, I want to extract information of some of the keys which I saved in a list beforehand. I can do it with a for-loop, but my list length is 15,504,603. It requires a very long time to process. I am looking for alternative ways of doing it.

My list of dictionaries (in reality this is query_set.QuerySet):

data = [
{'name': 'Alex', 'employee_id': 1110, 'age': 38, 'rank': 'CEO', 'salary': 'unknown'},
{'name': 'Monty', 'employee_id': 1111, 'age': 33, 'rank': 'EO', 'salary': 2400},
{'name': 'John', 'employee_id': 1114, 'age': 32, 'rank': 'EO', 'salary': 2200},
{'name': 'Max', 'employee_id': 1120, 'age': 26, 'rank': 'OA', 'salary': 1200},
{'name': 'Ginee', 'employee_id': 1130, 'age': 28, 'rank': 'OA', 'salary': 1200},
{'name': 'Adam', 'employee_id': None, 'age': 18, 'rank': 'summer_intern', 'salary': None}
]

The information I want to extract are 'name', 'age' and 'rank' So I make a list of keys beforehand:

info = ['name', 'age', 'rank']

I could do the task by performing a for loop

result = []
result.append(info)
for i in range(len(data)):
    output = [data[i][x] for x in info]
    result.append(output)

and finally

for item in result:
    print("\t".join(map(str,(item))))

and the result goes like:

name    age rank
Alex    38  CEO
Monty   33  EO
John    32  EO
Max 26  OA
Ginee   28  OA
Adam    18  summer_intern

In reality there are 15504603 dictionaries with 43 key : value within my list which is taking very long time to process. i.e. 22661/15504603 after ~2 hour of running.

What could be the ideal and time saving ways of doing this?

gmds
  • 19,325
  • 4
  • 32
  • 58
Ahsan
  • 47
  • 5

3 Answers3

0

If you would like to use pandas

import pandas as pd
df = pd.DataFrame(data)
df1 = df.loc[:,['name', 'age', 'rank']]
Vasu Devan
  • 176
  • 6
  • It worked. Additionally, I used ```df = pd.DataFrame(list(queryset))``` as my data were `QuerySet` [found here](https://stackoverflow.com/a/55055351/9960542). However, do you have any idea how to add a progress bar for this operation i.e. `tqdm` – Ahsan May 14 '19 at 07:33
  • I guess this answer https://stackoverflow.com/a/34365537/5684634 might help you. Appreciate if you can upvote my answer too. Two cheers if you can accept my answer. – Vasu Devan May 14 '19 at 07:52
  • I am a new user, I cannot upvote yet. :( However, https://stackoverflow.com/a/34365537/5684634 I myself found this, not helping a newbie like me. :( – Ahsan May 14 '19 at 12:50
0

Try operator.itemgetter:

list(map(operator.itemgetter(*info), data))

Output:

[('Alex', 38, 'CEO'),
 ('Monty', 33, 'EO'),
 ('John', 32, 'EO'),
 ('Max', 26, 'OA'),
 ('Ginee', 28, 'OA'),
 ('Adam', 18, 'summer_intern')]

This is about 6 times faster than the original loop:

test = data * 10000
# Given 60,000 dict

%%timeit

result = []
result.append(info)
for i in range(len(test)):
    output = [test[i][x] for x in info]
    result.append(output)
# 36.6 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit list(map(operator.itemgetter(*info), test))
# 6.92 ms ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Chris
  • 29,127
  • 3
  • 28
  • 51
  • This one worked perfectly along with the [answer from](https://stackoverflow.com/a/56122863/9960542) @Vasu Devan. However, in a for loop I can add a progress bar with `tqdm`, in this case how should I add a progress bar to it? – Ahsan May 14 '19 at 07:37
0

What's making your code slow is mainly the fact that you're building a huge, memory-hogging list just to be iterated over. You should directly print the output line by line as you iterate over the list of dicts instead:

print(*info, sep='\t')
for record in data:
    print(*(record[key] for key in info), sep='\t')
blhsing
  • 91,368
  • 6
  • 71
  • 106