Creating a dict out of a larger corpus

Question

I have a corpus of 30000 messages.

corpus = [
    "hello world", 
    "i like mars", 
    "a planet called venus", 
    ... , 
    "it's all pcj500"]

I have tokenized them and formed a word_set that contains all unique words.

word_lists = [text.split(" ") for text in corpus]
>>> [['hello', 'world'],
    ['i', 'like', 'mars'],
    ['a', 'planet', 'called', 'venus'],
    ...,
    ["it's", 'all', 'pcj500']]

word_set = set().union(*word_lists)
>>> ['hello', 'world', 'i', 'like', ..., 'pcj500']

I am trying to create a list of dictionaries with word in the word_set as keys with initial values as 0 for the count.
And if a word in word_set appear in word_list in word_lists, the appropriate count as the values.

For step 1, I am doing it this way,

tmp = corpus[:10]
word_dicts = []
for i in range(len(tmp)):
    word_dicts.append(dict.fromkeys(list(word_set)[:30], 0))

word_dicts
>>> [{'hello': 0,
  'world': 0,
  'mars': 0,
  'venus': 0,
  'explore': 0,
  'space': 0}]

Problem:

How can I perform the dict.fromkeys operation for all the texts in the corpus against all the items in word_set? For the whole corpus, I am running out of memory. There should be a better way to do this, but I am not able to find it on my own.

score 1 · Accepted Answer · answered Nov 15 '20 at 18:25

1

You could use defaultdict or Counter from collections, which use lazy keys. Example:

from collections import Counter

word_dicts = []
for words_list in word_lists:
    word_dicts.append(Counter(words_list))

answered Nov 15 '20 at 18:25

Marat

15,215
2
39
48

Creating a dict out of a larger corpus

1 Answers1