I have a file with 3 million sentences (approx). Each sentence has around 60 words. I want to combine all the words and find unique words from them.
I tried the following code:
final_list = list()
for sentence in sentence_list:
words_list = nltk.word_tokenize(sentence)
words = [word for word in words_list if word not in stopwords.words('english') ]
final_list = final_list + set(words)
This code gives unique words but, it's taking too long to process. Around 50k sentences per hour. It might take 3 days to process.
I tried with lambda function too:
final_list = list(map(lambda x: list(set([word for word in sentence])) ,sentence_list))
But, there is no significant improvement in execution. Please suggest a better solution with an effective time of execution. Parallel processing suggestions are welcome.