1

I have a database containing employee information. I have 4000 employees. Each employee has an unique identification number. I try to fetch employee information for each employee from the database using a python script. For 1 employee, the execution time for fetching info is 1 seconds. For 4000 employees, it makes 4000 seconds (67 minutes, who would like to wait that long?). The employee infos should be stored in a dictionary, in the following format:

infos = {"ID1": ["info for employee 1"], "ID2": ["info for employee 2"], ... }

I'm thinking of doing the following to reduce the execution time:

  1. Get employee id numbers
  2. Divide the id numbers into 10 groups
  3. Start 10 threads simultaneously
  4. Make each thread use 1 of 10 employee id groups and fetch those employees' info from database into separate dictionaries
  5. In the end, combine those 10 dictionaries

Is it possible? Would this method reduce the execution time 10 times?

alwbtc
  • 28,057
  • 62
  • 134
  • 188
  • 76 minutes is an incredibly long time. Are you connecting to the database 4000 times, or do you connect just once? – John Gordon Sep 15 '16 at 16:17
  • I connect just once, but if I start 10 threads, I think I will have to use 10 separate connectors, right? – alwbtc Sep 15 '16 at 16:21
  • Can you not just load the entire Employee data in a pandas dataframe in one go ? And then do the remainder of the processing in Python ? – Alex Sep 15 '16 at 16:22
  • Excuse me, what is pandas? The employee info is in a MySQL database. – alwbtc Sep 15 '16 at 16:27
  • Are you _sure_ you're not connecting 4000 times? If you have a module or function that connects to the database, queries one employee ID, and then disconnects, and you call that module 4000 times... – John Gordon Sep 15 '16 at 16:34
  • @JohnGordon no, I use `MySQLdb` module, it has a `Connection` class. I create this object once in the beginning, and then pass SQL queries to it. It uses these queries to fetch employee infos. It stays connected all the time, ie it doesn't get disconnected after a query. – alwbtc Sep 15 '16 at 16:40
  • 3
    It will be a lot faster to issue one query that returns all employees, and then just discard any rows you don't care about. – John Gordon Sep 15 '16 at 16:45
  • @JohnGordon hmm, interesting. I will give it a try. But I'm disappointed with `threading` module – alwbtc Sep 15 '16 at 19:23
  • @JohnGordon you were right, I fetched whole data from database to an array first, then discard the items I don't want. Execution time is reduced to 10 seconds. Thanks. – alwbtc Sep 17 '16 at 10:51

1 Answers1

1

I think you're confusing threading with concurrency.

Threading is the act of simply using multiple threads of execution at the same time. This doesn't mean multiple actions are done simultaneously though... your processor still has to switch between the threads. This technique is useful when you're expecting to wait a long time for an operation to complete (like reading a large file) and you want other stuff to happen in the meantime (printing a progress bar, for example).

Concurrency is when you create multiple threads, and different threads get assigned to different cores in the processor. This is, of course, dependent on the hardware (you need a multicore processor to accomplish it).

Python's threading module allows for multiple threads to be created, sure; and in an ideal world it would automatically handle concurrency. But the default Python interpreter, CPython, does not allow for native concurrency due to the "Global Interpreter Lock". (See this page just above the section 17.1.3 heading and this SO post for more information about GIL.) In fact, using threading like you suggest may even make your execution slower (see here).

I've not done concurrency in Python, but I would advise you to check out the multiprocessing module as a possible solution.

Community
  • 1
  • 1
Pierce Darragh
  • 2,072
  • 2
  • 16
  • 29
  • So you're saying it is not possible on a Windows 7 machine using Intel Core I5 CPU? Each thread will fetch info into a separate dictionary (ie they won't use the same memory location), and will use separate database connector objects, is it still not possible? – alwbtc Sep 15 '16 at 16:34
  • @alwbtc `threading` in the default Python interpreter will not allow for concurrent execution. The OS doesn't matter, and the number of cores available does not matter. Look into `multiprocessing` or a different implementation of Python other than CPython if you desperately want concurrent execution, but I think there's something else wrong in your code if that query takes over an hour to complete. – Pierce Darragh Sep 15 '16 at 16:37
  • 1
    It depends where the slowdown is happening. If the database can only respond to one query per second, then even if you were able to _send_ the queries more quickly, it wouldn't help. – John Gordon Sep 15 '16 at 16:38