I have a file of 65,000 docs and their contents. I have broken this file in two data sets as training and test data set. I want to break the training data set in small files by number of lines and train my model but the code is producing only first break up and keeps on producing that. Most probably, I am consuming the used generator every time. I have posted the code for reference below. Any improvement or logical error finding will be widely appreciated. Thanks. Code to create training and test data sets :
fo = open('desc_py_output.txt','rb')
def generate_train_test(doc_iter,size):
while True:
data = [line for line in itertools.islice(doc_iter, size)]
if not data:
break
yield data
for i,line in enumerate(generate_train_test(fo,50000)):
if(i==0):
training_data = line
else:
test_data = line
Now I am trying to create small files of 5000 docs using the following code:
def generate_in_chunks(doc_iter,size):
while True:
data = [line for line in itertools.islice(doc_iter, size)]
if not data:
break
yield data
for i,line in enumerate(generate_in_chunks(training_data,5000)):
x = [member.split('^')[2] for member in line]
y = [member.split('^')[1] for member in line]
print x[0]
this is printing same documents again and again.