python sentence tokenizing according to the word index of dictionary

Question

I have a vocabulary with the form of dic = {'a':30, 'the':29,....}, the key is the word, the value is its word count.

I have some sentences, like:

"this is a test"

"an apple"

....

In order to tokenize the sentences, each sentence will be encoded as the word index of dictionary. If the word in a sentence also exist in the dictionary, get this word's index; otherwise set the value to 0.

for example, I set the sentence dimension to 6, if the length of a sentence is smaller than 6, padding 0s to make it 6 dimension.

"this is the test" ----> [2, 0, 2, 4, 0, 0] "an apple" ----> [5, 0, 0, 0, 0, 0,]

Here is my sample code:

words=['the','a','an'] #use this list as my dictionary
X=[]

with open('test.csv','r') as infile:
    for line in infile:
        for word in line:
            if word in words:
                X.append(words.index(word))
            else: X.append(0)

My code has some problem because the output is not correct; in addition, I have no idea how to set the sentence dimension and how to padding.

You've explained what you are trying to solve. What is the problem you are facing? Have you started implementing this yet? — idjaw, Oct 18 '16 at 01:27

score 1 · Answer 1 · answered Oct 18 '16 at 02:46

1

There are a couple of issues with your code:

You're not tokenizing on a word, but a character. You need to split up each line into words
You're appending into one large list, instead of a list of lists representing each sentence/line
Like you said, you don't limit the size of the list
I also don't understand why you're using a list as a dictionary

I edited your code below, and I think it aligns better with your specifications:

words={'the': 2,'a': 1,'an': 3}
X=[]

with open('test.csv','r') as infile:
    for line in infile:
        # Inits the sublist to [0, 0, 0, 0, 0, 0]
        sub_X = [0] * 6

        # Enumerates each word in the list with an index
        # split() splits a string by whitespace if no arg is given
        for idx, word in enumerate(line.split()):
            if word in words:
                 # Check if the idx is within bounds before accessing
                 if idx < 6:
                     sub_X[idx] = words[word]

        # X represents the overall list and sub_X the sentence
        X.append(sub_X)

answered Oct 18 '16 at 02:46

ajoseps

1,871
1
16
29

Thanks, it is working. Could you give me a hint to deal with the specific column of the CSV file? for example, if I only want to read the second column. – Kun Oct 18 '16 at 03:11
If this is part of a larger project in which you're parsing many CSV files or large complex ones, I would recommend you using the [panda](http://pandas.pydata.org/) module. It's great for accessing specific columns in a CSV. You can look at this [related stackoverflow](http://stackoverflow.com/questions/16503560/read-specific-columns-from-csv-file-with-python-csv) as well. – ajoseps Oct 18 '16 at 03:21
I used pandas earlier, but the process was killed when reading the whole file into the memory. That's why I am using the regular way to read and process data. – Kun Oct 18 '16 at 12:23
Have you tried using a generator and reading one line at a time instead of the entire file in memory? – ajoseps Oct 18 '16 at 16:06
I used read line by line in the regular way, but I have no idea if pandas support reading data line by line – Kun Oct 18 '16 at 17:08
use [chunksize](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv). Here is a relevant [stackoverflow](http://stackoverflow.com/questions/29334463/pandas-read-csv-file-line-by-line) – ajoseps Oct 18 '16 at 18:02
Thanks, it is really helpful – Kun Oct 18 '16 at 19:46
Hi, I ran your code, why it generated the word count instead of word index? it is supposed to generate [0, 0, 1, 3, ...600], right now it was [87092, 9822, 0, 221212,....] – Kun Oct 20 '16 at 14:15
Do you have a sample input? I thought your intent was to create a n-dimensional list where each index represents the count of the word that was listed in the dictionary – ajoseps Oct 20 '16 at 14:39
I think I fixed that by using a way to get the index of the dictionary – Kun Oct 20 '16 at 14:41
yes, use sub_X[idx] = order.keys().index(word), the order is generated by collection.OrderedDict() – Kun Oct 20 '16 at 14:45

python sentence tokenizing according to the word index of dictionary

1 Answers1