I have a very large csv file (11 Million lines).
I would like to create batches of data. I cannot for the life of me figure out how to read n lines in a generator (where I specifiy what n is, sometimes I want it to be 50, sometimes 2). I came up with a kluge that works one time, but I could not get this to iterate a second time. Generators are quite new to me, so it took me a bit even to get the calling down. (For the record, this is a clean dataset with 29 values every line)
import numpy as np
import csv
def getData(filename):
with open(filename, "r") as csv1:
reader1 = csv.reader(csv1)
for row1 in reader1:
yield row1
def make_b(size, file):
gen = getData(file)
data=np.zeros((size,29))
for i in range(size):
data[i,:] = next(gen)
yield data[:,0],data[:,1:]
test=make_b(4,"myfile.csv")
next(test)
next(test)
The reason for this is to use an example of batching data in keras. While it is possible to use different methods to get all of the data into memory, I am trying to introduce to students the concepts of batching data from a large dataset. Since this is a survey course, I wanted to demonstrate batching in data from a large text file, which has proved frustrating difficult for such an 'entry-level' task. (its actually much easier in tensorflow proper, but i am using keras to introduce high level concepts of the MLP).