10

I've come across a behavior in python's built-in csv module that I've never noticed before. Typically, when I read in a csv, it's following the doc's pretty much verbatim, using 'with' to open the file then looping over the reader object with a 'for' loop. However, I recently tried iterating over the csv.reader object twice in a row, only to find out that the second 'for' loop did nothing.

import csv

with open('smallfriends.csv','rU') as csvfile:
readit = csv.reader(csvfile,delimiter=',')

for line in readit:
    print line

for line in readit:
    print 'foo'

Console Output:

Austins-iMac:Desktop austin$ python -i amy.py 
['Amy', 'James', 'Nathan', 'Sara', 'Kayley', 'Alexis']
['James', 'Nathan', 'Tristan', 'Miles', 'Amy', 'Dave']
['Nathan', 'Amy', 'James', 'Tristan', 'Will', 'Zoey']
['Kayley', 'Amy', 'Alexis', 'Mikey', 'Sara', 'Baxter']
>>>
>>> readit
<_csv.reader object at 0x1023fa3d0>
>>> 

So the second 'for' loop basically does nothing. One thought I had is the csv.reader object is being released from memory after being read once. This isn't the case though since it still retains it's memory address. I found a post that mentions a similar problem. The reason they gave is that once the object is read, the pointer stay's at the end of the memory address ready to write data to the object. Is this correct? Could someone go into greater detail as to what is going on here? Is there a way to push the pointer back to the beginning of the memory address to reread it? I know it's bad coding practices to do that but I'm mainly just curious and wanting to learn more about what goes on under Python's hood.

Thanks!

Community
  • 1
  • 1
Austin A
  • 2,990
  • 6
  • 27
  • 42
  • 1
    Once you've consumed the iterator, `readit` in the first loop, it's bascially empty. – monkut Dec 03 '14 at 06:17
  • So it can be thought of as "read-once" then? – Austin A Dec 03 '14 at 06:19
  • 2
    Yes, the reader object is similar to (if not) a generator object, pulling and parsing lines from the file as requested (via `next()`). Once consumed it (run through the whole file) you'll need to restart the file at the beginning, or read all the data into memory, if you want to process it again. – monkut Dec 03 '14 at 06:22
  • The post that I included in my question mentions something about `reset()` the object. How would you use that? – Austin A Dec 03 '14 at 06:24
  • You need to think about what you're trying to do. Generally, it's probably best to do any necessary processing in one pass. If you're going to be re-using the data frequently enough, go ahead and read it into memory (as a list for example). – monkut Dec 03 '14 at 06:49

3 Answers3

9

I'll try to answer your other questions about what the reader is doing and why reset() or seek(0) might help. In the most basic form, the csv reader might look something like this:

def csv_reader(it):
    for line in it:
        yield line.strip().split(',')

That is, it takes any iterator producing strings and gives you a generator. All it does is take an item from your iterator, process it and return the item. When it is consumed, the csv_reader will quit. The reader has no idea where the iterator came from or how to properly make a fresh one, so it doesn't even try to reset itself. That is left to the programmer.

We can either modify the iterator in place without the reader knowing or just make a new reader. Here are some examples to demonstrate my point.

data = open('data.csv', 'r')
reader = csv.reader(data)

print(next(reader))               # Parse the first line
[next(data) for _ in range(5)]    # Skip the next 5 lines on the underlying iterator
print(next(reader))               # This will be the 7'th line in data
print(reader.line_num)            # reader thinks this is the 2nd line
data.seek(0)                      # Go back to the beginning of the file
print(next(reader))               # gives first line again

data = ['1,2,3', '4,5,6', '7,8,9']
reader = csv.reader(data)         # works fine on lists of strings too
print(next(reader))               # ['1', '2', '3']

In general if you need a 2nd pass, its best to close/reopen your files and use a new csv reader. Its clean and ensures nice bookkeeping.

kalhartt
  • 3,999
  • 20
  • 25
4

Iterating over a csvreader simply wraps iterating over the lines in the underlying file object. On each iteration the reader gets the next line from the file, converts and returns it.

So iterating over a csvreader follows the same conventions as iterating over files. That is, once the file reached its end you'd have to seek to the start before iterating a second time.

The below should do, though I haven't tested it:

import csv

with open('smallfriends.csv','rU') as csvfile:
    readit = csv.reader(csvfile,delimiter=',')

    for line in readit:
        print line

    # go back to the start of the file
    csvfile.seek(0)

    for line in readit:
        print 'foo
sebastian
  • 9,526
  • 26
  • 54
3

If it's not too much data, you can always read it into a list:

import csv

with open('smallfriends.csv','rU') as csvfile:
    readit = csv.reader(csvfile,delimiter=',')
    csvdata = list(readit)

    for line in csvdata :
        print line

    for line in csvdata :
        print 'foo'
monkut
  • 42,176
  • 24
  • 124
  • 155
  • 1
    Yea, that's always an option, but I'm more interested in what's happening at a lower level and why the object can't be iterated over a second time. – Austin A Dec 03 '14 at 06:21