How to parallelize this query in Python (just like PLINQ)?

Question

I'm having some issues with a query I wrote in Python (have to use it for TensorFlow), which works just fine but is way too slow, since the input dataset is quite big. It can take over 5 minutes for the query to complete, and checking the Task Manager I can confirm it's indeed running on a single core.

Here's the code:

# Assume words is a list of strings
for i, pair in enumerate(sorted(
    ((word, words.count(word))          # Map each word to itself and its count
        for word in set(words)),        # Set of unique words (remove duplicates)
    key=lambda p: p[1],                 # Order by the frequency of each word
    reverse=True)):                     # Descending order - less frequent words last

    # Do stuff with each sorted pair

What I'm doing here is just to take the input list words, get rid of the duplicates, then sort the words in descending order based on their frequency in the input text.

If I were to write this in C# using PLINQ, I'd do something like this:

var query = words.AsParallel().Distinct()
            .OrderByDescending(w => words.Count(s => s.Equals(w)))
            .Select((w, i) => (w, i));

I couldn't find an easy way to rewrite the paralell implementation in Python using possibly built-in libraries. I saw some guides about the Pool extension, but that looks like it's just an equivalent of the parallel Select operation, so I'd still miss how to implement the Distinct and OrderByDescending operations in Python, in parallel.

Is it possible to do this with built-in libraries, or are there commonly used 3rd party libraries to do this?

Thanks!

roganjosh · Accepted Answer · 2018-04-08T16:41:29.637

The problem with your current approach is mostly based on words.count(word) inside a for loop. This means that you iterate the entire list for every unique word in set(words), and only count a single word... instead, you can use Counter and make a single pass of your list. The Counter object is a dictionary, which you can use for your key in sorting, with frequency lookups of O(1). Even for 1000 "words" in my example, the speed-up is dramatic... for longer inputs, I got bored waiting for timeit to finish :)

import string
from collections import Counter
import numpy as np # Just to create fake data

# Create some fake data
letters = list(string.ascii_lowercase)
new_words = [''.join(list(np.random.choice(letters, 3, replace=True))) 
             for x in range(1000)]


def original(search_list):
    """ Your current approach """
    for i, pair in enumerate(sorted(
    ((word, search_list.count(word)) 
        for word in set(search_list)),
    key=lambda p: p[1],
    reverse=True)):    
        pass


def new_approach(search_list):
    freq = Counter(search_list)
    search_list = sorted(search_list, key=lambda x: freq[x], reverse=True)
    new_list = []
    checked = set()
    for item in search_list:
        if item not in checked:
            new_list.append(item)
            checked.add(item)

For a list of 1000 "words":

%timeit original(new_words)
26.6 ms ± 289 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


%timeit new_approach(new_words)
833 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You should see whether this new approach is suitable for your needs before trying to use something like multiprocessing as that may be adding extra code complexity that isn't necessary once you fix your time complexity issue.

EDIT:

As pointed out by the OP, we can skip the intermediate list and set by simply sorting the Counter object:

def new_approach(search_list):
    freq = Counter(search_list)
    search_list = enumerate(sorted(freq, key=lambda x: freq[x], reverse=True))

New timing:

%timeit new_approach(new_words)
438 µs ± 6.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Thanks, the `Counter` class is perfect to solve this issue! I don't understand the code you posted though, I mean, why do you call `sorted` on the original `search_list`, which contains duplicates? Wouldn't it be better to just call `sorted` on the `freq` object (since a dictionary can be iterated), and then just `enumerate` on that sorted list? After all, I just need to iterate those unique words in descending order, I'm not sure why you're creating those two additional `new_list` and `checked` variables. — Sergio0694, Apr 08 '18 at 16:16
@Sergio0694 good question! First see [this](https://stackoverflow.com/questions/15479928/why-is-the-order-in-dictionaries-and-sets-arbitrary/15479974)... I then looked into implementations of ordered sets and it adds a _lot_ of complexity I didn't think was necessary. The latest versions of Python are preserving order as an implementation point, not something to rely on on... that's changing in releases around now (3.6/3.7) — roganjosh, Apr 08 '18 at 16:20
I wasn't talking about the sort order of the dictionary (as dictionaries, by definition, don't have a specific keys order), what I mean is: why don't you just call `enumerate(sorted(freq, key=lambda x: freq[x], reverse=True))` instead of iterating the original list, and then creating those two additional lists/sets later on? The sorted list returned by `sorted` should already be what I was looking for (regardless of the specific order of the dictionary, which depends on the specific implementation, as you said), right? — Sergio0694, Apr 08 '18 at 16:23
@Sergio0694 I think you might well be correct on that, I can't test in 2.7 right now. Until I can be sure of that, I'd still have to generate a list of keys from the dictionary and then sort that, so I'm not sure you gain much. I'm reluctant to update answer till I can test. — roganjosh, Apr 08 '18 at 16:33
@Sergio0694 I can't see how it would differ in P 2.7 actually, so I will update the answer with timings. Thanks for pointing it out. — roganjosh, Apr 08 '18 at 16:38

How to parallelize this query in Python (just like PLINQ)?

1 Answers1