0

I have a text dataset. Those dataset consist of many lines that each lines are consist of two sentences split by tab, like this :

this is string 1, first sentence.    this is string 2, first sentence.
this is string 1, second sentence.    this is string 2, second sentence.

and then I have split the datatext by this code :

#file readdata.py
from globalvariable import *
import os


class readdata:
    def dataAyat(self):
        global kalimatayat
        fo = open(os.path.join('E:\dataset','dataset.txt'),"r")
        line = []
        for line in fo.readlines():
            datatxt = line.rstrip('\n').split('\t')
            newdatatxt = [x.split('\t') for x in datatxt]
            kalimatayat.append(newdatatxt)
            print newdatatxt

readdata().dataAyat()

it works and the output is :

[['this is string 1, first sentence.'],['this is string 2, first sentence.']]
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]

what I want to do is tokenize those list using nltk word tokenize, and the output I expect is like this :

[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']]
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]

anybody knows how to tokenize to be like the output above? I want to code a tokenize function in "tokenizer.py" and call it all in "mainfile.py"

Bob Gilmore
  • 12,608
  • 13
  • 46
  • 53
sang
  • 375
  • 2
  • 8
  • 23

1 Answers1

1

To tokenize the list of sentences, iterate over it and store the results in a list:

data = [[['this is string 1, first sentence.'],['this is string 2, first sentence.']],
[['this is string 1, second sentence.'],['this is string 2, second sentence.']]]
results = []
for sentence in data:
    sentence_results = []
    for s in sentence:
        sentence_results.append(nltk.word_tokenize(sentence))
    results.append(sentence_results)

results will be something like

[[['this' , 'is' , 'string' , '1' , ',' , 'first' , 'sentence' , '.'],  
  ['this' , 'is' , 'string' , '2' , ',' , 'first' , 'sentence' , '.']], 
[['this' , 'is' , 'string' , '1' , ',' , 'second' , 'sentence' , '.'],
  ['this' , 'is' , 'string' , '2' , ',' , 'second' , 'sentence' , '.']]]
GPhilo
  • 18,519
  • 9
  • 63
  • 89
  • pardon me, sir. I want to ask again. if readdata and tokenizer apart in two py files (readdata.py and tokenizer.py) and I want to join them in main file (mainfile.py), how to code a tokenizer.py and mainfile.py to tokenize like the results above? – sang Mar 08 '17 at 09:23
  • Uhm I'm not sure I understand your question. Firstly, you need to get the text as something that can be handled. In your `readData` function you seem to store it in a global `kalimatayat` list (which should _definitely_ be replaced by a class member). That variable is basically what I called `data` in the example I gave in my answer. – GPhilo Mar 08 '17 at 09:30
  • hmmm so "data" is a "kalimatayat". ok sir, I'll try. After all, thanks for the answer :) – sang Mar 08 '17 at 09:36
  • I have edited the question, sir. you may check it again. – sang Mar 08 '17 at 09:43