I have a corpus of 30000 messages.
corpus = [
"hello world",
"i like mars",
"a planet called venus",
... ,
"it's all pcj500"]
I have tokenized them and formed a word_set
that contains all unique words.
word_lists = [text.split(" ") for text in corpus]
>>> [['hello', 'world'],
['i', 'like', 'mars'],
['a', 'planet', 'called', 'venus'],
...,
["it's", 'all', 'pcj500']]
word_set = set().union(*word_lists)
>>> ['hello', 'world', 'i', 'like', ..., 'pcj500']
- I am trying to create a list of dictionaries with
word in the word_set
as keys with initial values as0
for the count. - And if a
word in word_set
appear inword_list in word_lists
, the appropriate count as the values.
For step 1, I am doing it this way,
tmp = corpus[:10]
word_dicts = []
for i in range(len(tmp)):
word_dicts.append(dict.fromkeys(list(word_set)[:30], 0))
word_dicts
>>> [{'hello': 0,
'world': 0,
'mars': 0,
'venus': 0,
'explore': 0,
'space': 0}]
Problem:
How can I perform the dict.fromkeys
operation for all the texts in the corpus against all the items in word_set
? For the whole corpus, I am running out of memory. There should be a better way to do this, but I am not able to find it on my own.