1

I have a very large csv file (11 Million lines).

I would like to create batches of data. I cannot for the life of me figure out how to read n lines in a generator (where I specifiy what n is, sometimes I want it to be 50, sometimes 2). I came up with a kluge that works one time, but I could not get this to iterate a second time. Generators are quite new to me, so it took me a bit even to get the calling down. (For the record, this is a clean dataset with 29 values every line)

import numpy as np
import csv

def getData(filename):
    with open(filename, "r") as csv1:
        reader1 = csv.reader(csv1)
        for row1 in reader1:
            yield row1

def make_b(size, file):
    gen = getData(file)
    data=np.zeros((size,29))
    for i in range(size):
        data[i,:] = next(gen)
    yield data[:,0],data[:,1:] 

test=make_b(4,"myfile.csv")
next(test)
next(test)

The reason for this is to use an example of batching data in keras. While it is possible to use different methods to get all of the data into memory, I am trying to introduce to students the concepts of batching data from a large dataset. Since this is a survey course, I wanted to demonstrate batching in data from a large text file, which has proved frustrating difficult for such an 'entry-level' task. (its actually much easier in tensorflow proper, but i am using keras to introduce high level concepts of the MLP).

RDS
  • 476
  • 1
  • 6
  • 12
  • 1
    The answer from this question might be of use: https://stackoverflow.com/a/8290508/1844376 – ScottMcC Apr 04 '18 at 04:10
  • @ScottMcC does the file iterable know the length? I suppose I know the actual line count in this particular example, but I was hoping to be able to generalize it to any file without having to know the file length. – RDS Apr 04 '18 at 04:47
  • https://docs.python.org/3/library/itertools.html#itertools.islice – Burhan Khalid Apr 04 '18 at 04:50
  • If your blocks all have the same length have a look at the `grouper` (not `groupby`) recipe on [this page](https://docs.python.org/3/library/itertools.html). – Paul Panzer Apr 04 '18 at 05:09
  • As a side note, instead of `for row in reader 1: yield row1`, you could just `yield from csv.reader(csv1)`. – abarnert Apr 04 '18 at 07:28
  • More importantly: If you're trying to read the whole CSV file into an 11million by 29 array in memory, you almost certainly want to use [`np.loadtxt`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) (or one of the fancier CSV-handling functions, if that doesn't work). It's going to be a lot faster than anything that loops over rows, or even over batches, and it's one line instead of requiring a (relatively) complicated generator function. – abarnert Apr 04 '18 at 07:33
  • while it is technically possibly to read this file into memory, my plan is to use this as an example for students in keras using the fit_generator. Its less about speed and more about the concept batching data (and most keras generators focus on images rather than a single source, or a binary read in chunks rather than text file in lines). I will definitely look at the itertools/grouper to see if that provides a clearer solution. – RDS Apr 04 '18 at 13:41

0 Answers0