I have a vocabulary with the form of dic = {'a':30, 'the':29,....}, the key is the word, the value is its word count.
I have some sentences, like:
"this is a test"
"an apple"
....
In order to tokenize the sentences, each sentence will be encoded as the word index of dictionary. If the word in a sentence also exist in the dictionary, get this word's index; otherwise set the value to 0.
for example, I set the sentence dimension to 6, if the length of a sentence is smaller than 6, padding 0s to make it 6 dimension.
"this is the test" ----> [2, 0, 2, 4, 0, 0] "an apple" ----> [5, 0, 0, 0, 0, 0,]
Here is my sample code:
words=['the','a','an'] #use this list as my dictionary
X=[]
with open('test.csv','r') as infile:
for line in infile:
for word in line:
if word in words:
X.append(words.index(word))
else: X.append(0)
My code has some problem because the output is not correct; in addition, I have no idea how to set the sentence dimension and how to padding.