2

I'm fairly new to Python, and I'm trying to write some huge lists (with random letters inside). Actually it takes me around 75 - 80 seconds on my machine for 2,000,000 lines.

import timeit
import random, string

global_tab     = []
global_nb_loop = 2000000

print("Generate %d lines" % global_nb_loop)
global_tab = []
for x in range(global_nb_loop):
    global_tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))
print("%d lines generated" % len(global_tab))

And the result with linux time command:

$ time python3 DEV/PyETL/generateList.py 
Generate 2000000 lines
2000000 lines generated

real    1m16.844s
user    1m16.609s
sys 0m0.203s

I was surprised when monitoring system resources that only 1 core was at 100%, instead of 4 like on a Windows machine on which I've tested this too.

Of course I've tried to apply some threads, but I'm facing a problem: it takes more time than running on a single core. Maybe threads are not the solution or I'm probably using them wrong.

Here is the new code:

import random, string
import threading

global_tab         = []
global_nb_threads  = 4
global_nb_loop     = 2000000


threadLock         = threading.Lock()

class generateList(threading.Thread):
    def __init__(self, name):
        threading.Thread.__init__(self)
        self.name = name

    def run(self):
        global global_tab
        self.tab = []

        print("[%s] Generate %d lines" % (self.name, int(global_nb_loop/global_nb_threads)))
        # divide desirated lines with number of threads
        for x in range(int(global_nb_loop/global_nb_threads)):
            self.tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))

        threadLock.acquire()
        global_tab += self.tab
        threadLock.release()
        del self.tab
        print("[%s] %d lines in list" % (self.name, len(global_tab)))


for i in range(global_nb_threads):
    # Create threads
    t = generateList("Thread-" + str(i))
    # Start
    t.start()

for i in range(global_nb_threads):
    # Wait for threads end
    t.join()

And the execution:

$ time python3 DEV/PyETL/generateListThreads.py 
[Thread-0] Generate 500000 lines
[Thread-1] Generate 500000 lines
[Thread-2] Generate 500000 lines
[Thread-3] Generate 500000 lines
[Thread-3] 500000 lines in list
[Thread-0] 1000000 lines in list
[Thread-2] 1500000 lines in list
[Thread-1] 2000000 lines in list    
real    1m40.858s
user    1m41.208s
sys 0m0.916s

32 seconds more than 1 core with 100%, but monitoring shows that the 8 cores were with 20 - 40% load at the same time.

Since all threads are working at the same time, generating fewer rows and synchronizing only for updating a global variable, shouldn't the execution time be lower than a single core?

zondo
  • 19,901
  • 8
  • 44
  • 83
aviel
  • 178
  • 9

3 Answers3

2

I am pretty sure your lock is not necessary and is slowing you down. (edit: actually, I just noticed the lock is used after the majority of the work is done, so isn't really relevant.)

global_tab += self.tab is (I think) atomic through the Python GIL. (Actually, this only claims list.extend(), so use that instead. Here's another reference: Are lists thread safe?

Alternatively, I would try multiprocessing.imap_unordered with a large chunksize. The downside is the results are sent over by stream, but your random string processing might overshadow that.

import multiprocessing
import random
import string

def randomword(x):
    return ''.join(random.choice(string.ascii_letters) for i in range(15))

pool = multiprocessing.Pool(8)
results = pool.imap_unordered(randomword, range(100))
print([r for r in results])

For 2 million strings (I changed it to print the length):

$ time python r.py                                                                 
2000000

real    0m38.305s
user    1m31.717s
sys     0m25.853s

I also tried cleaning up your version a bit and got:

$ time python rr.py 
[Thread-0] Generate 250000 lines
[Thread-1] Generate 250000 lines
[Thread-2] Generate 250000 lines
[Thread-3] Generate 250000 lines
[Thread-4] Generate 250000 lines
[Thread-5] Generate 250000 lines
[Thread-6] Generate 250000 lines
[Thread-7] Generate 250000 lines
[Thread-4] 250000 lines in list
[Thread-1] 500000 lines in list
[Thread-7] 750000 lines in list
[Thread-0] 1000000 lines in list
[Thread-6] 1250000 lines in list
[Thread-2] 1500000 lines in list
[Thread-3] 1750000 lines in list
[Thread-5] 2000000 lines in list

real    0m22.113s
user    0m24.969s
sys     0m5.537s

A couple of significant changes:

  • use xrange() on the large ranges (ah, python3 already does this.)
  • remove the threadlock
  • use extend() on the global.

(my results were about the same when just appending to the global_tab, btw, and leaving out the temporary list.)

import random, string
import threading

global_tab         = []
global_nb_threads  = 8
global_nb_loop     = 2000000

class generateList(threading.Thread):
    def __init__(self, name):
        threading.Thread.__init__(self)
        self.name = name

    def run(self):
        global global_tab
        self.tab = []

        print("[%s] Generate %d lines" % (self.name, int(global_nb_loop/global_nb_threads)))
        for x in range(int(global_nb_loop/global_nb_threads)):
            self.tab.append(("".join( [random.choice(string.ascii_letters) for i in range(15)] ), "".join( [random.choice(string.digits) for i in range(2)])))

        global_tab.extend(self.tab)
        print("[%s] %d lines in list" % (self.name, len(global_tab)))


for i in range(global_nb_threads):
    t = generateList("Thread-" + str(i))
    t.start()

for i in range(global_nb_threads):
    t.join()

...but, single threaded is still slightly faster at 16 seconds.

If I tune multiprocessing, I can get it down to 6 seconds:

size = 2000000
processes = 8
pool = multiprocessing.Pool(processes)
results = [r for r in pool.imap_unordered(randomword, range(size), chunksize=int(size/processes))]
print(len(results))

output:

$ time python r.py                                                                 
2000000

real    0m5.713s
user    0m35.594s
sys     0m0.546s

...so I think that's my final answer: Use multiprocessing.

Community
  • 1
  • 1
rrauenza
  • 6,285
  • 4
  • 32
  • 57
  • Really amazing, your first solution was really slow (2,000,000 in 3m8s) and 8 cores 50% load, but your multiprocessing tuning is really impressive : 2,000,000 in 17s only and 8 cores fully loaded! I'll do some tests again trying the various map and apply as Bi Rico told, but I think I won't reduce time more than you did. Thanks to you – aviel Jun 18 '16 at 15:12
  • You can also (for the heck of it) try `multiprocessing.dummy` which uses threads. – rrauenza Jun 18 '16 at 15:16
  • I'll try this evening, and compare. Thanks again – aviel Jun 18 '16 at 15:17
  • The downside to `multiprocessing` (just glancing at the source) is I think it uses pickle and sockets to communicate. That adds a bit of overhead. – rrauenza Jun 18 '16 at 15:31
2

From python threading docs:

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

Basically this means that threads in python don't improve performance unless the threads are mostly waiting for something to happen. Multiprocessing works pretty well in python, but because the processes don't share any objects or global state the programming model is a little different for multiprocessing. Here is an example of how you could use multiprocessing:

import multiprocessing
import random
import string

def randomData(i):
    data = ("".join(random.sample(string.ascii_letters, 15)),
            "".join(random.sample(string.digits, 2)))
    return data

global_nb_loop = 2000000
pool = multiprocessing.Pool(8)
results = pool.imap(randomData, xrange(global_nb_loop))
global_tab = list(results)
print len(global_tab)

The multiprocessing module has a lot of versions of map and apply, ie imap, map_async and so on. Look through the docs to find the one that's best for your problem.

Bi Rico
  • 25,283
  • 3
  • 52
  • 75
  • Thanks for reply, I thought to use multiprocessing too but didn't test before asking question, I stopped on fact that threads were not better than single core at 100% load. I'll do tests and see, but rrauenza have a very good optimisation yet in his answer which generate my list in 17 seconds on first run. – aviel Jun 18 '16 at 15:03
  • This is very useful on linux, but when testing on windows, I must add these instructions : if __name__ == '__main__': multiprocessing.freeze_support() – aviel Jun 20 '16 at 08:53
  • On a better machine than mine, it take 1m15s to run and numpy runs in less than 4s. I'll do better tests at home tonight, but I need code to be efficient on all platforms. It seems that Windows will cause performance issues. – aviel Jun 20 '16 at 09:31
1

Since you are working with large amounts of data, I recommend a look at numpy. Generally numpy is slower than lists but more memory efficient, and really good for many vectorized operations. You can always go down the multiprocessing route, even with numpy.

Here is a version that runs 3x faster than the original problem (for reference, the original version ran in 30.3 s on my machine).

import numpy as np


def numpy_test(N=2000000):
    global_nb_loop = N 
    global_tab     = []
    asc_list = list('abcdefghijklmnopqrstuvwxyz')

    print("Generate %d lines" % global_nb_loop)
    global_tab = [(u.tostring(),str(v)) for u,v in zip( np.random.choice(asc_list, (N, 15)), np.random.randint(10, 100, N) )]
    print("%d lines generated" % len(global_tab))


In [306]: %timeit numpy_test()
Generate 2000000 lines
2000000 lines generated
Generate 2000000 lines
2000000 lines generated
Generate 2000000 lines
2000000 lines generated
Generate 2000000 lines
2000000 lines generated
1 loop, best of 3: 11.1 s per loop
clocker
  • 1,376
  • 9
  • 17
  • Thanks for this reply, as I can see in runs faster than multiprocessing solution given by rrauenza (14.3s vs 21.8s). As I've seen it only use 1 core at 100% load, i'll try to use numpty with multiprocessing and see if I get better results. I have a little problem with generated output, the output contains a tuple with binary and string. I was expecting a tuple of strings or string and int. I'll look how to convert binary to string – aviel Jun 19 '16 at 18:51
  • Your welcome. I was under the impression you needed something that looks like this in global_tab: `('pkiwgbxjqfrwamf', '26'), ('ronabuyobwmnffb', '39'),('ekharacqatebpnz', '52')]`. As a side note, there is a really good essay on optimization in the standard python docs under [link](https://www.python.org/doc/essays/list2str). The speedup factors don't match what I observe on my machine but the ranking of functions in terms of speed does. – clocker Jun 19 '16 at 19:38
  • Thanks for link. I'll look tomorrow (time for bed now). Your impression was good, I need something like you told in global tab, but if I print your answer I get tuples like this one : [(b'v\x00\x00\x00n\x00\x00\x00m\x00\x00\x00f\x00\x00\x00s\x00\x00\x00k\x00\x00\x00v\x00\x00\x00m\x00\x00\x00s\x00\x00\x00j\x00\x00\x00l\x00\x00\x00v\x00\x00\x00t\x00\x00\x00h\x00\x00\x00c\x00\x00\x00', '42')]. If I remove all the \x00, I find the randomized word. Any ideas? – aviel Jun 19 '16 at 20:55
  • Ah, the issue might be to do with python3 vs python2. The solution I posted uses python2 not python3, so it needs to be adapted accordingly. My experience with python3 is minimal but I think strings have different representation; my guess is the asc_list shortcut needs modification. Sorry for the confusion. – clocker Jun 19 '16 at 23:04