1

I have a list of lists, in which each inner-list is a tokenized text, so its length is the number of words in the text.

corpus = [['this', 'is', 'text', 'one'], ['this', 'is', 'text', 'two']]

Now, I want to create a set that contains all unique tokens from the corpus. For the above example, the desired output would be:

{'this', 'is', 'text', 'one', 'two}

Currently, I have:

all_texts_list = list(chain(*corpus))
vocabulary = set(all_texts_list)

But this seems a memory-inefficient way of doing it.

Is there a more efficient way to obtain this set?


I found this link. However, there they want to find the set of unique lists and not the set of unique elements from the list.

Emil
  • 1,531
  • 3
  • 22
  • 47

1 Answers1

1

You can use a simple for loop with set update operation.

vocabulary = set()

for tokens in corpus:
    vocabulary.update(tokens)

Output:

{'this', 'one', 'text', 'two', 'is'}
Vishal Singh
  • 6,014
  • 2
  • 17
  • 33