10

Apparently some csv output implementation somewhere truncates field separators from the right on the last row and only the last row in the file when the fields are null.

Example input csv, fields 'c' and 'd' are nullable:

a|b|c|d
1|2||
1|2|3|4
3|4||
2|3

In something like the script below, how can I tell whether I am on the last line so I know how to handle it appropriately?

import csv

reader = csv.reader(open('somefile.csv'), delimiter='|', quotechar=None)

header = reader.next()

for line_num, row in enumerate(reader):
    assert len(row) == len(header)
    ....
ʞɔıu
  • 47,148
  • 35
  • 106
  • 149
  • 3
    Note that what you posted is not CSV. CSV stands for "Comma Sepaerated Values" - the separator MUST be a comma. –  Jun 09 '09 at 16:02
  • 8
    most csv implementations allow you to specify an alternate character as the separator – ʞɔıu Jun 09 '09 at 16:06
  • @Neil: huh? "The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. There is no “CSV standard” [..] Still, while the delimiters and quoting characters vary, the overall format is similar enough that.[..]" – SilentGhost Jun 09 '09 at 16:08
  • 3
    The only near CSV standard, RFC4180, does not. –  Jun 09 '09 at 16:09
  • @SilenTGhost CSV is special case of a more general data format called DSV (delimitter separated values). DSV is the format you are talking about. –  Jun 09 '09 at 16:11
  • 2
    In Python, per http://docs.python.org/library/csv.html, "the delimiters and quoting characters vary" -- in particular the delimiter is not necessarily a comma, so the csv module is also used to read "non-comma" separated values, ridiculous as the use may actually seem (just like, e.g., in Excel, "export to CSV" can use any delimiter, depending in that case on the Windows Regional Settings). – Alex Martelli Jun 09 '09 at 16:13
  • 4
    If it makes anyone feel any better I think it's safe to say that the input I have to deal with here obeys no standard of any decency. – ʞɔıu Jun 09 '09 at 16:14

7 Answers7

16

Basically you only know you've run out after you've run out. So you could wrap the reader iterator, e.g. as follows:

def isLast(itr):
  old = itr.next()
  for new in itr:
    yield False, old
    old = new
  yield True, old

and change your code to:

for line_num, (is_last, row) in enumerate(isLast(reader)):
    if not is_last: assert len(row) == len(header)

etc.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
6

I am aware it is an old question, but I came up with a different answer than the ones presented. The reader object already increments the line_num attribute as you iterate through it. Then I get the total number of lines at first using row_count, then I compare it with the line_num.

import csv

def row_count(filename):
    with open(filename) as in_file:
        return sum(1 for _ in in_file)

in_filename = 'somefile.csv'
reader = csv.reader(open(in_filename), delimiter='|')

last_line_number = row_count(in_filename)
for row in reader:
    if last_line_number == reader.line_num:
        print "It is the last line: %s" % row
Eduardo
  • 4,282
  • 2
  • 49
  • 63
  • But you first loop through the entire file to get the number of total lines. Not very efficient. And what happens when the file changed between you opened it the first time and the second time? – Stefan Apr 12 '21 at 15:12
3

If you have an expectation of a fixed number of columns in each row, then you should be defensive against:

(1) ANY row being shorter -- e.g. a writer (SQL Server / Query Analyzer IIRC) may omit trailing NULLs at random; users may fiddle with the file using a text editor, including leaving blank lines.

(2) ANY row being longer -- e.g. commas not quoted properly.

You don't need any fancy tricks. Just an old-fashioned if-test in your row-reading loop:

for row in csv.reader(...):
    ncols = len(row)
    if ncols != expected_cols:
        appropriate_action()
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • I agree but the source of this data refuses/is too incompetent to send me correctly formatted data. I have no choice but to handle its quirks myself. – ʞɔıu Jul 30 '09 at 16:03
  • Yes you have to handle its quirks yourself and I'm just pointing out that more quirks than "missing trailing null fields in last row" should be checked for in generality AND they can be checked simply without fancy code -- I don't understand your "but". – John Machin Jul 30 '09 at 16:50
1

if you want to get exactly the last row try this code:

with open("\\".join([myPath,files]), 'r') as f:
    print f.readlines()[-1] #or your own manipulations

If you want to continue working with values from row do the following:

f.readlines()[-1].split(",")[0] #this would let you get columns by their index
demid5111
  • 21
  • 3
0

If you use for row in reader:, it will just stop the loop after the last item has been read.

iMom0
  • 12,493
  • 3
  • 49
  • 61
0

Could you not just catch the error when the csv reader reads the last line in a

try: ... do your stuff here... except: StopIteration

condition ?

See the following python code on stackoverflow for an example of how to use the try: catch: Python CSV DictReader/Writer issues

Community
  • 1
  • 1
Alex Boschmans
  • 515
  • 2
  • 12
  • won't tell you when you're on the last line, only will tell you after you've passed the last line. – ʞɔıu Jul 30 '09 at 16:04
  • 1
    I reread your question again, and you're right, that's not what you are asking - you want a way to deal with the last line. Why can't you use the solution by John Machin supplied below ? – Alex Boschmans Jul 30 '09 at 18:45
0

Just extend the row to the length of the header:

for line_num, row in enumerate(reader):
    while len(row) < len(header):
        row.append('')
    ...
serbaut
  • 5,852
  • 2
  • 28
  • 32