Can this be modified to run faster?

Question

I'm creating a word list using python that hits every combination of of characters which is a monster of a calculation past 94⁴. Before you ask where I'm getting 94, 94 covers ASCII characters 32 to 127. Understandably this function runs super slow, I'm curious if there's a way to make it more efficient.

This is the meat and potatoes of the my code.

def CreateTable(name,ASCIIList,size):
    f = open(name + '.txt','w')
    combo = itertools.product(ASCIIList, repeat = size)
    for x in combo:
        passwords = ''.join(x)
        f.write(str(passwords) + '\n')
    f.close()

I'm using this so that I can make lists to use in a brute force where I don't know the length of the passwords or what characters the password contains. Using a list like this I hit every possible combination of words so I'm sure to hit the right one eventually. Having stated earlier that this is a slow program this also slow to read in and will not my first choice for a brute force, this more or less for a last ditch effort.

To give you an idea of how long that piece of code runs. I was creating all the combinations of size 5 and ran for 3 hours ending at a little over 50GB.

How long does it take? How much faster do you need it to be? — Blorgbeard, Nov 21 '17 at 21:31
Note that you are writing a ~450MB file to disk here. Adding one more character takes it to ~50 **GB**. This is the problem with brute-force. — Blorgbeard, Nov 21 '17 at 21:35
To give you an idea of how long that piece of code runs. I was creating all the combinations of size 5 and ran for 3 hours ending at a little over 50GB. I know that brute force is a slow and painful process but sometimes it's the only way. I just need a way to handle `combo` in smaller chunks which @pookie gave me the ground work for. — AwesomeBob2341, Nov 22 '17 at 23:32
I imagine that the bottleneck here is (should be) disk-write speed. So you should be able to do a lot better than 3 hours for 50GB. I would suggest looking at [buffering](https://stackoverflow.com/questions/3167494/how-often-does-python-flush-to-a-file) rather than multithreading. I hope you don't want to go much higher than 5 characters though - size 6 would take about 5.5TB. Size 7 would be about 580TB. Size 8: 60PB! — Blorgbeard, Nov 22 '17 at 23:53

score 1 · Accepted Answer · answered Nov 21 '17 at 21:53

Warning : I have not tested this code.

I would convert combo to a list: combo_list = list(combo) I would then break it into chunks:

# https://stackoverflow.com/a/312464/596841
def get_chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

# Change 1000 to whatever works.
chunks = get_chunks(combo_list, 1000)

Next, I would use multithreading to process each chunk:

class myThread (threading.Thread):
   def __init__(self, chunk_id, chunk):
      threading.Thread.__init__(self)
      self.chunk_id = chunk_id
      self.chunk = chunk

   def run(self):
      print ("Starting " + self.chunk_id)
      process_data(self.chunk)
      print ("Exiting " + self.chunk_id)

def process_data():
  f = open(self.chunk_id + '.txt','w')
  for item in self.chunk:
      passwords = ''.join(item)
      f.write(str(passwords) + '\n')
  f.close()

I would then do something like this:

threads = []
for i, chunk in enumerate(chunks):
    thread = myThread(i, chunk)
    thread.start()
    threads.append(thread)

# Wait for all threads to complete
for t in threads:
   t.join()

You could then write another script to merge all the output files, if you need.

That's actually not a bad idea. I had thought about splitting the output into smaller chunks and I also had the idea to print it to multiple files, but wasn't sure where to start. I'll be playin around with this code for a while. Thanks! — AwesomeBob2341, Nov 22 '17 at 23:21

Blorgbeard · Answer 2 · 2017-11-23T18:12:29.337

I did some testing on this, and I think the main problem is that you're writing in text mode.

Binary mode is faster, and you're only dealing with ASCII, so you might as well just spit out bytes rather than strings.

Here's my code:

import itertools
import time

def CreateTable(name,ASCIIList,size):
    f = open(name + '.txt','w')
    combo = itertools.product(ASCIIList, repeat = size)
    for x in combo:
        passwords = ''.join(x)
        f.write(str(passwords) + '\n')
    f.close()

def CreateTableBinary(name,ASCIIList,size):
    f = open(name + '.txt', 'wb')
    combo = itertools.product(ASCIIList, repeat = size)
    for x in combo:
        passwords = bytes(x)
        f.write(passwords)
        f.write(b'\n')
    f.close()

def CreateTableBinaryFast(name,first,last,size):
    f = open(name + '.txt', 'wb')
    x = bytearray(chr(first) * size, 'ASCII')
    while True:
        f.write(x)
        f.write(b'\n')

        i = size - 1
        while (x[i] == last) and (i > 0):
            x[i] = first
            i -= 1
        if i == 0 and x[i] == last:
            break
        x[i] += 1
    f.close()

def CreateTableTheoreticalMax(name,ASCIIList,size):
    f = open(name + '.txt', 'wb')
    combo = range(0, len(ASCIIList)**size)
    passwords = b'A' * size
    for x in combo:
        f.write(passwords)
        f.write(b'\n')
    f.close()

print("writing real file in text mode")
start = time.time()
chars = [chr(x) for x in range(32, 126)]
CreateTable("c:/temp/output", chars, 4)
print("that took ", time.time() - start, "seconds.")

print("writing real file in binary mode")
start = time.time()
chars = bytes(range(32, 126))
CreateTableBinary("c:/temp/output", chars, 4)
print("that took ", time.time() - start, "seconds.")

print("writing real file in fast binary mode")
start = time.time()
CreateTableBinaryFast("c:/temp/output", 32, 125, size)
print("that took ", time.time() - start, "seconds.")

print("writing fake file at max speed")
start = time.time()
chars = [chr(x) for x in range(32, 126)]
CreateTableTheoreticalMax("c:/temp/output", chars, 4)
print("that took ", time.time() - start, "seconds.")

Output:

writing real file in text mode
that took  101.5869083404541 seconds.
writing real file in binary mode
that took  40.960529804229736 seconds.
writing real file in fast binary mode
that took  35.54869604110718 seconds.
writing fake file at max speed
that took  26.43029284477234 seconds.

So you can see a pretty big improvement just by switching to binary mode.

Also, there still seems to be some slack to take up, since omitting the itertools.product and writing hard-coded bytes is even faster. Maybe you could write your own version of product that directly output bytes-like objects. Not sure about that.

Edit: I had a go at a manual itertools.product working directly on a bytearray. It's a bit faster - see "fast binary mode" in the code.

Can this be modified to run faster?

2 Answers2