0

I have a code that puts its data and declare it in a dictionary. I am currently having a long time in my for loop that is about 200,000 thousand datas taking about 2 hours. And now I am thinking what more if I have 2 million datas.

Here is my for loop example(Sorry for the naming of variables, this just my sample code):

# Gets the data in database
data_list = self.my_service.get_database_list()

my_dict_list = {}

for item in data_list:
    primary_key = item.primarykey
    value = item.name + item.address + item.age

    my_dict_list[primary_key] = value

This is my model/db get code:

def get_database_list(self):
    return self.session.query(
        self.mapper.name,
        self.mapper.addreess,
        self.mapper.age,
        )

My database engine is InnoDB. Is there a way to make it a bit optimize or loop through datas faster. Thank you for sharing.

Bazinga
  • 2,456
  • 33
  • 76
  • 3
    I'm guessing that the `my_service.get_database_list()` isn't returning an iterator? – Henrik Andersson Nov 25 '13 at 06:00
  • @limelights, Hi, what do you mean by iterator? – Bazinga Nov 25 '13 at 06:01
  • 2
    I think he means a generator - an object that can be iterated over item per item, without first having to build a huge list in-memory. That's your most likely culprit; the `for` loop looks about as tight as you can make it. – Tim Pietzcker Nov 25 '13 at 06:02
  • You can try `threading`. –  Nov 25 '13 at 06:02
  • what is sys.getsizeof(item)? maybe you db has a hard time fetching and returning all your rows? – alko Nov 25 '13 at 06:03
  • @TimPietzcker, Sorry but I am kinda new to python. Would you mean showing some sample data. And I updated the question on how it just gets the data out of the data. – Bazinga Nov 25 '13 at 06:05
  • @e.campver I haven't try the map function. Because I dont know how to use my code with it. – Bazinga Nov 25 '13 at 06:06
  • 1
    And please add info on your db engine. What is self.mapper? – alko Nov 25 '13 at 06:07
  • Here's a fantastic explanation about `iterators`, i.e. the `yield` expression in python. http://stackoverflow.com/questions/231767/the-python-yield-keyword-explained – shengy Nov 25 '13 at 06:09
  • @alko I have included my db engine. It is declare in our setting connecting to the database that is being used. – Bazinga Nov 25 '13 at 06:10
  • @Superman to see if it is using iterators rather than precomputing lists, try checking `type(self.my_service.get_database_list())`. If it is `list`, it is precomputing things and likely using a lot of memory; if it is `generator`, it isn't; if it ends in 'iterator', it *probably* isn't; if it is something else, then all bets are off without more information. It may also help to know which Python module you're using to communicate with your database, and how `self.mapper` and `self.session` are created. – lvc Nov 25 '13 at 06:32

3 Answers3

3

First, I doubt your bottleneck (several hours) lies in python part. You can get some improvements with generators and dict comprehensions, but by how much? Look for a sample for 200 000 rows:

import base64
import os
def random_ascii_string(srt_len):
    return base64.urlsafe_b64encode(os.urandom(3*srt_len))[0:srt_len]

>>> data = [{'id': x, 'name': random_ascii_string(10), 'age':'%s' % x,
             'address': random_ascii_string(20)} for x in xrange(2*10**5)]

Your approach

>>> timeit.timeit("""
... from __main__ import data
... my_dict_list = {}
... for item in data:
...     my_dict_list[item['id']] = item['name'] + item['address'] + item['age']""",
...         number = 100)
16.727806467023015

List comprehension

>>> timeit.timeit("from __main__ import data; "
...    "my_dict_list = { d['id']: d['name']+d['address']+d['age'] for d in data}",
...     number = 100)
14.474646358685249

I doubt you can find two hours in those optimisation. So your first task is to find your bottleneck. I advise you to have a look at MySQL part of your job, and probably redisign it to:

  • use a separate inno db file per table
  • use indexes if retrieving smaller part of data
  • make some evaluations at db side, such as name + address + age
  • do not make processing for the whole data, retrieve only needed part (several first rows)
alko
  • 46,136
  • 12
  • 94
  • 102
2

It's hard to just guess where your code spends the most time. The best thing to do is to run it using cProfile, and examine the results.

python -m cProfile -o prof <your_script> <args...>

This outputs a file named prof, which you can examine in various ways, the coolest of which is using runsnakerun.

Other than that, off the top of head, dict-comrehension is often faster than the alternatives:

my_dict_list = { item.primarykey: item.name + item.address + item.age }

Also, it is not exactly clear what item.name + item.address + item.age does (are they all strings?), but if you can consider changing your data structure, and storing item instead of that combined value, it might help further.

shx2
  • 61,779
  • 13
  • 130
  • 153
0

Agreed with above comments on iterators. You could try using a dictionary comprehension in place of the loop.

import uuid
import time

class mock:
    def __init__(self):
        self.name = "foo"
        self.address = "address"
        self.age = "age"
        self.primarykey = uuid.uuid4()

data_list = [mock() for x in range(2000000)]

my_dict_list = {}
t1 = time.time()
for item in data_list:
    primary_key = item.primarykey
    value = item.name + item.address + item.age
    my_dict_list[primary_key] = value
print(time.time() - t1)


my_dict_list = {}
t2 = time.time()
new_dict = { item.primarykey: item.name + item.address + item.age for item in data_list }
print(time.time() - t2)
Tim Wilder
  • 1,607
  • 1
  • 18
  • 26