Trying to optimize quicksort for larger files

Question

Does anyone know how I can optimize this code better to run larger files. It works with smaller inputs, but I need it to run a file with over 200,000 words. Any suggestions?

Thank you.

import random
import re

def quick_sort(a,i,n):
    if n <= 1:
        return
    mid = (len(a)) // 2
    x = a[random.randint(0,len(a)-1)]
    p = i - 1
    j = i
    q = i + n
    while j < q:
        if a[j] < x:
            p = p + 1
            a[j],a[p] = a[p],a[j]
            j = j + 1
        elif a[j] > x:
            q = q - 1
            a[j],a[q] = a[q],a[j]
        else:
            j = j + 1
    quick_sort(a,i,p-i+1)
    quick_sort(a,q,n-(q-i))

file_name = input("Enter file name: ")
my_list = []
with open(file_name,'r') as f:     
    for line in f:                     
        line = re.sub('[!#?,.:";\']', '', line).lower()
        token = line.split()    
        for t in token:
            my_list.append(t)

a = my_list
quick_sort(a,0,len(my_list))
print("List After Calling Quick Sort: ",a)

Possibly related: https://stackoverflow.com/questions/7918060/how-do-i-sort-very-large-files — xdhmoore, May 01 '20 at 02:38
@Joshua Varghese It does but it's a similar problem algorithmically. — xdhmoore, May 01 '20 at 02:43
Is there any solution to do this with quick sort in particular? My merge sort code works, but my quick sort isn't as successful. Also, I'm can't use functions like sort(). — john1999, May 01 '20 at 02:48
There seems to be a description of 'external quicksort' here: https://en.wikipedia.org/wiki/Quicksort#Variants — xdhmoore, May 01 '20 at 02:54
Decorate your function with `@functools.lru_cache()` (put it in the line above `def quick_sort():` and put `import functools` at top of file). For recursive intensive functions like yours it may help. Run it and see. If it doesn't help, you take 30s to undo this change :) — Niloct, May 01 '20 at 02:57
@functools.lur_cache() give me a type error "unhashable type: 'list'". I'll try the external quicksort and give you guys an update if it works. — john1999, May 01 '20 at 03:10
@Niloct: Caching is inappropriate here. To start with, quicksort never recurses on the same arguments unless you call it more than once. And it modifies the list input in place, so two calls with the same arguments may need to do different things depending on the contents of the list each time (if the list was changed in between the calls, the second one needs to run, you can't rely upon the cached result). — Blckknght, May 01 '20 at 03:14
@john1999: can you elaborate on what is going wrong with your code? Is it raising an exception? Giving the wrong results? Just taking too long? — Blckknght, May 01 '20 at 03:16
So I keep having this error every time I try to run the larger file through it: "RecursionError: maximum recursion depth exceeded while calling a Python object" — john1999, May 01 '20 at 03:20
@Niloct: That's not what answers are for. And caching is not useful for a sorting program. — Blckknght, May 01 '20 at 03:24
This issue isn't really related to the size of the input, a larger size just lets the bug (which I haven't fully identified yet) cause recursion errors. The real issue is something related to the `n` value you pass to later calls. Often you end up sorting the same sub-list a whole bunch of times. — Blckknght, May 01 '20 at 03:27
Yeah so after some testing that is literally the issue. For some reason it keeps sorting the same sub-list even when it is already sorted. — john1999, May 01 '20 at 03:53
Must you use a quicksort? If you can use a heap or PriorityQueue, see my answer. — smci, May 01 '20 at 04:12

score 2 · Accepted Answer · answered May 01 '20 at 03:51

Your random selection of an index to use for your pivot x is using the whole size of the input list a, not just the part you're supposed to be sorting on the current call. This means that very often your pivot won't be in the current section at all, and so you won't be able to usefully reduce your problem (because all of the values will be on the same side of the pivot). This leads to lots and lots of recursion, and for larger inputs you'll almost always hit the recursion cap.

The fix is simple, just change how you get x:

x = a[random.randrange(i, i+n)]

I like randrange a lot better than randint, but you could use randint(i, i+n-1) if you feel the other way.

score 0 · Answer 2 · answered May 01 '20 at 04:16

Must you use a quicksort? If you can use a heapq or PriorityQueue, the .get/(.pop()) methods automatically implement the sort:

import sys
from queue import PriorityQueue

pq = PriorityQueue()

inp = open(sys.stdin.fileno(), newline='\n')

#inp = ['dag', 'Rug', 'gob', 'kex', 'mog', 'Wes', 'pox', 'sec', 'ego', 'wah'] # for testing
for word in inp:
    word = word.rstrip('\n')
    pq.put(word)

while not pq.empty():
    print(pq.get())

Then test with some large random word input or file e.g.:

shuf /usr/share/dict/words | ./word_pq.py

where shuf is Gnu /usr/local/bin/shuf.

The `queue` module is for synchronized communication between multiple threads. For a single-threaded application like this one, you probably should use `heapq` directly (it's what is used by the `queue` module to implement `PriorityQueue`). — Blckknght, May 01 '20 at 23:23

Trying to optimize quicksort for larger files

2 Answers2