1

I have a file with 3 million sentences (approx). Each sentence has around 60 words. I want to combine all the words and find unique words from them.

I tried the following code:

 final_list = list()
 for sentence in sentence_list:
     words_list = nltk.word_tokenize(sentence)
     words = [word for word in words_list if word not in stopwords.words('english') ]
     final_list = final_list + set(words)

This code gives unique words but, it's taking too long to process. Around 50k sentences per hour. It might take 3 days to process.

I tried with lambda function too:

    final_list = list(map(lambda x: list(set([word for word in sentence])) ,sentence_list))

But, there is no significant improvement in execution. Please suggest a better solution with an effective time of execution. Parallel processing suggestions are welcome.

Bharath kumar k
  • 119
  • 2
  • 5
  • 13
  • If words are already the elements of sentence, why do you need a list comprehension `[word for word in sentence]`? Why not just run `set(sentence)` directly? – dmh Dec 07 '18 at 08:41
  • because the sentence is a continuous and then I have to tokenize them. I have a condition to apply before sending to list – Bharath kumar k Dec 07 '18 at 08:43
  • Ah, thanks for updating the example :) – dmh Dec 07 '18 at 08:50

1 Answers1

4

You need to do it all lazily and with as few intermediate lists and as possible (reducing allocations and processing time). All unique words from a file:

import itertools
def unique_words_from_file(fpath):
    with open(fpath, "r") as f:
        return set(itertools.chain.from_iterable(map(str.split, f)))

Let's explain the ideas here.

File objects are iterable objects, which means that you can iterate over the lines of a file!

Then we want the words from each line, which is splitting them. In this case, we use map in Python3 (or itertools.imap in Python2) to create an object with that computation over our file lines. map and imap are also lazy, which means that no intermediate list is allocated by default and that is awesome because we will not be spending any resources on something we don't need!

Since str.split returns a list, our map result would be a succession of lists of strings, but we need to iterate over each of those strings. For doing that there is no need of building another list, we can use itertools.chain to flatten that result!

Finally, we call to set, which will iterate over those words and kept just a single one for each of them. Voila!

Let's make an improvement! can we make str.split also lazy? Yes! check this SO answer:

import itertools
import re

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

def unique_words_from_file(fpath):
    with open(fpath, "r") as f:
        return set(itertools.chain.from_iterable(map(split_iter, f)))
Netwave
  • 40,134
  • 6
  • 50
  • 93