0

I have a vocabulary with the form of dic = {'a':30, 'the':29,....}, the key is the word, the value is its word count.

I have some sentences, like:

"this is a test"

"an apple"

....

In order to tokenize the sentences, each sentence will be encoded as the word index of dictionary. If the word in a sentence also exist in the dictionary, get this word's index; otherwise set the value to 0.

for example, I set the sentence dimension to 6, if the length of a sentence is smaller than 6, padding 0s to make it 6 dimension.

"this is the test" ----> [2, 0, 2, 4, 0, 0] "an apple" ----> [5, 0, 0, 0, 0, 0,]

Here is my sample code:

words=['the','a','an'] #use this list as my dictionary
X=[]

with open('test.csv','r') as infile:
    for line in infile:
        for word in line:
            if word in words:
                X.append(words.index(word))
            else: X.append(0)

My code has some problem because the output is not correct; in addition, I have no idea how to set the sentence dimension and how to padding.

Kun
  • 581
  • 1
  • 5
  • 27

1 Answers1

1

There are a couple of issues with your code:

  1. You're not tokenizing on a word, but a character. You need to split up each line into words

  2. You're appending into one large list, instead of a list of lists representing each sentence/line

  3. Like you said, you don't limit the size of the list

  4. I also don't understand why you're using a list as a dictionary

I edited your code below, and I think it aligns better with your specifications:

words={'the': 2,'a': 1,'an': 3}
X=[]

with open('test.csv','r') as infile:
    for line in infile:
        # Inits the sublist to [0, 0, 0, 0, 0, 0]
        sub_X = [0] * 6

        # Enumerates each word in the list with an index
        # split() splits a string by whitespace if no arg is given
        for idx, word in enumerate(line.split()):
            if word in words:
                 # Check if the idx is within bounds before accessing
                 if idx < 6:
                     sub_X[idx] = words[word]

        # X represents the overall list and sub_X the sentence
        X.append(sub_X)
ajoseps
  • 1,871
  • 1
  • 16
  • 29
  • Thanks, it is working. Could you give me a hint to deal with the specific column of the CSV file? for example, if I only want to read the second column. – Kun Oct 18 '16 at 03:11
  • If this is part of a larger project in which you're parsing many CSV files or large complex ones, I would recommend you using the [panda](http://pandas.pydata.org/) module. It's great for accessing specific columns in a CSV. You can look at this [related stackoverflow](http://stackoverflow.com/questions/16503560/read-specific-columns-from-csv-file-with-python-csv) as well. – ajoseps Oct 18 '16 at 03:21
  • I used pandas earlier, but the process was killed when reading the whole file into the memory. That's why I am using the regular way to read and process data. – Kun Oct 18 '16 at 12:23
  • Have you tried using a generator and reading one line at a time instead of the entire file in memory? – ajoseps Oct 18 '16 at 16:06
  • I used read line by line in the regular way, but I have no idea if pandas support reading data line by line – Kun Oct 18 '16 at 17:08
  • use [chunksize](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv). Here is a relevant [stackoverflow](http://stackoverflow.com/questions/29334463/pandas-read-csv-file-line-by-line) – ajoseps Oct 18 '16 at 18:02
  • Thanks, it is really helpful – Kun Oct 18 '16 at 19:46
  • Hi, I ran your code, why it generated the word count instead of word index? it is supposed to generate [0, 0, 1, 3, ...600], right now it was [87092, 9822, 0, 221212,....] – Kun Oct 20 '16 at 14:15
  • Do you have a sample input? I thought your intent was to create a n-dimensional list where each index represents the count of the word that was listed in the dictionary – ajoseps Oct 20 '16 at 14:39
  • I think I fixed that by using a way to get the index of the dictionary – Kun Oct 20 '16 at 14:41
  • yes, use sub_X[idx] = order.keys().index(word), the order is generated by collection.OrderedDict() – Kun Oct 20 '16 at 14:45