1

I have a file of 65,000 docs and their contents. I have broken this file in two data sets as training and test data set. I want to break the training data set in small files by number of lines and train my model but the code is producing only first break up and keeps on producing that. Most probably, I am consuming the used generator every time. I have posted the code for reference below. Any improvement or logical error finding will be widely appreciated. Thanks. Code to create training and test data sets :

fo = open('desc_py_output.txt','rb')
def generate_train_test(doc_iter,size):

    while True:
        data = [line for line in itertools.islice(doc_iter, size)]
        if not data:
            break
        yield data

for i,line in enumerate(generate_train_test(fo,50000)):
    if(i==0):
        training_data = line
    else:
        test_data = line

Now I am trying to create small files of 5000 docs using the following code:

def generate_in_chunks(doc_iter,size):

    while True:
        data = [line for line in itertools.islice(doc_iter, size)]
        if not data:
            break
        yield data

for i,line in enumerate(generate_in_chunks(training_data,5000)):
    x = [member.split('^')[2] for member in line]
    y = [member.split('^')[1] for member in line]
    print x[0]

this is printing same documents again and again.

Pappu Jha
  • 477
  • 1
  • 3
  • 14

1 Answers1

1

The generate_train_test function yields lists, thus in your generate_in_chunks function, doc_iter is a list, not an iterator. A list does not get consumed, thus the islice will always start again from the beginning. Make sure doc_iter is an iterator at the beginning, then it will work. Also, it seems like you can use the same function for both.

def chunkify(doc_iter, size):
    doc_iter = iter(doc_iter) # make sure doc_iter really is an iterator
    while True:
        data = [line for line in itertools.islice(doc_iter, size)]
        if not data:
            break
        yield data

Alternatively, you could return a generator instead of a list, but this will only work if you consume that generator before yielding the next one (otherwise you'd get into an infinite loop). In that case, you could use something like this.

Community
  • 1
  • 1
tobias_k
  • 81,265
  • 12
  • 120
  • 179