I'm having some issues with a query I wrote in Python (have to use it for TensorFlow), which works just fine but is way too slow, since the input dataset is quite big. It can take over 5 minutes for the query to complete, and checking the Task Manager I can confirm it's indeed running on a single core.
Here's the code:
# Assume words is a list of strings
for i, pair in enumerate(sorted(
((word, words.count(word)) # Map each word to itself and its count
for word in set(words)), # Set of unique words (remove duplicates)
key=lambda p: p[1], # Order by the frequency of each word
reverse=True)): # Descending order - less frequent words last
# Do stuff with each sorted pair
What I'm doing here is just to take the input list words
, get rid of the duplicates, then sort the words in descending order based on their frequency in the input text.
If I were to write this in C# using PLINQ, I'd do something like this:
var query = words.AsParallel().Distinct()
.OrderByDescending(w => words.Count(s => s.Equals(w)))
.Select((w, i) => (w, i));
I couldn't find an easy way to rewrite the paralell implementation in Python using possibly built-in libraries. I saw some guides about the Pool extension, but that looks like it's just an equivalent of the parallel Select
operation, so I'd still miss how to implement the Distinct
and OrderByDescending
operations in Python, in parallel.
Is it possible to do this with built-in libraries, or are there commonly used 3rd party libraries to do this?
Thanks!