7

So I tried I tried calculating millions and millions of different combinations of the below string but I was only calculating roughly 1,750 combinations a second which isn't even near the speed I need. So how would I reshape this so multiple processes of the same thing are calculating different parts, while not calculating parts that have already been calculated and also maintaining fast speeds? The code below is partially what I've been using. Any examples would be appreciated!

from itertools import product
for chars in product("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ12234567890!@#$%^&*?,()-=+[]/;", repeat = 4):
   print chars
Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
Noah R
  • 5,287
  • 21
  • 56
  • 75

2 Answers2

11

One way to break the product up into parts is to break up the first component of the product, so that each independent job has all the elements starting with a certain set of first letters. For example:

import string
import multiprocessing as mp
import itertools

alphabet = string.ascii_letters+string.digits+"!@#$%^&*?,()-=+[]/;"
num_parts = 4
part_size = len(alphabet) // num_parts

def do_job(first_bits):
    for x in itertools.product(first_bits, alphabet, alphabet, alphabet):
        print(x)

if __name__ == "__main__":
    pool = mp.Pool()
    results = []
    for i in xrange(num_parts):
        if i == num_parts - 1:
            first_bit = alphabet[part_size * i :]
        else:
            first_bit = alphabet[part_size * i : part_size * (i+1)]
        results.append(pool.apply_async(do_job(first_bit)))

    pool.close()
    pool.join()

(where obviously you'd only use results if do_job actually returned something).

evandrix
  • 6,041
  • 4
  • 27
  • 38
Danica
  • 28,423
  • 6
  • 90
  • 122
  • 1
    Okay, I did this and it completed it's self in about 10 seconds. The bad thing is that all my computer's resources were taken up and it almost had a heart attack. How can I limit that amount of threads because I had my task manager up and had about 80 python processes running at once. – Noah R Apr 21 '12 at 19:29
  • 1
    :) That's what the `...` I left out in `multiprocessing.Pool` is for. Try e.g. `Pool(processes=4)`. See [the multiprocessing docs](http://docs.python.org/library/multiprocessing.html) for more. – Danica Apr 21 '12 at 19:31
  • Well I switched it to 4 processes and it got even worse and had a couple hundred on at once and then my computer crashed. – Noah R Apr 21 '12 at 19:38
  • If you say 4 processes, then `multiprocessing` should only be starting 4 processes -- are you doing something else in your code that launches Python processes? – Danica Apr 21 '12 at 19:39
  • No I literally copied the code above and made a new Python project with it and ran the code switching the pool to 4 processes. – Noah R Apr 21 '12 at 19:41
  • I just updated the question with the exact code I ran (by putting it in `test_multi.py` and running `python test_multi.py`, which is Python 2.7.2), which started four processes and quite happily zipped along printing out tuples. There were five processes running (one master plus four in the pool), and everything was fine except my terminal used 100% CPU trying to keep up with all the output. Is this not what you're seeing? How are you running it? – Danica Apr 21 '12 at 19:47
  • Alright I ran the script with 2 processes got 80% CPU and probably 400+ processes all using on avg 1200KB of memory. It starts out with roughly 5 but then they multiply into the hundreds. I tried taking a screenshot but my computer crashed again. I checked the updated code and it matches mine except I only put 2 processes this third time around. Really didn't help. – Noah R Apr 21 '12 at 19:58
  • What's more interesting is that I used one of the documentation's examples and it worked creating the specified processes. So I think it's something in your code. – Noah R Apr 21 '12 at 20:07
  • 2
    @NoahR I bet you're on Windows, right? `multiprocessing` actually imports the module on Windows, as opposed to Unix OSes like mine where that doesn't happen, so each element of the pool was spawning its own pool. Try the new version with the `__main__` check. – Danica Apr 21 '12 at 20:19
  • @Dougal Thanks for this awesome example code you have written. I am trying to do something similar... So lets say instead of printing x you are looking for a match. How do you exit all processes when a match is found? – user2109254 Jun 07 '16 at 11:58
  • @user2109254 You'd probably want to iterate over [`imap_unordered`](https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.Pool.imap_unordered), having `do_work` return when it sees a result, and then when you get a match call [`pool.terminate()`](https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.Pool.terminate) (and then `join()`). – Danica Jun 07 '16 at 15:28
3

Are you sure you're only getting 1750 combinations per second? I'm getting about 10 million.

def test(n):
    start = time.time()
    count = 0
    for chars in product("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ12234567890!@#$%^&*?,()-=+[]/;", repeat = 4):

        count += 1
        if count == n: break
    return time.time() - start    

>>> test(10000)
0.03300023078918457
>>> test(1000000)
0.15799999237060547
>>> test(10000000)
1.0469999313354492

I don't think my computer is that much faster than yours.

note: I posted this as an answer because I wanted to show code. It's really more of a comment. So please, no upvotes or downvotes.

Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
  • 1
    The different is probably that the OP is `print`ing while you're just looping through; I/O is slow. – Danica Apr 21 '12 at 19:27
  • Well my main script is actually saving to a database as it's calculating different combinations so that slows it down a lot more too. – Noah R Apr 21 '12 at 19:31
  • @Dougal: Agreed, I/O is slow. His question identifies `itertools.product` as the bottleneck. If he's doing his timing with I/O, this should clue him in. – Steven Rumbalski Apr 21 '12 at 19:33
  • @Noah R: Based on your comment, I'd say you've misdiagnosed the real bottleneck. It doesn't matter how many processes you chop this up into, I/O will be your likely bottleneck. If you distribute the I/O to multiple processes, having each connect to your database, would probably be even slower. – Steven Rumbalski Apr 21 '12 at 19:36
  • Well, since he said it completed quickly using an enormous `multiprocessing.Pool`, clearly parallelism is helping him out... but yes, `itertools.product` is also clearly not the actual bottleneck. – Danica Apr 21 '12 at 19:37