2

I am parsing a large data file using:

reader = csv.DictReader(open('Sourcefile.txt','rt'), delimiter = '\t')
for row in reader:
  etc
  etc

Parsing works great but I am performing calculations on the data, which require me to directly access the line I'm on, the line before, or to skip 10 lines ahead.

I can't figure out how to get the actual line number of the file I am in, and how to move to some other line in the file (ex: "Current_Line" + 10) and start accessing data from that point forward in the file.

Is the solution to read the entire file into an array, rather than trying to move back and forth in the file? I am expecting this file to be upwards of 160MB and assumed moving back and forth in the file would be most memory efficient.

AlG
  • 14,697
  • 4
  • 41
  • 54
DyTech
  • 159
  • 1
  • 9
  • Have you tried an `enumerate` on your iteration? I'm not sure how you are identifying your line (regex, etc...), but then you have the line number, and can do a `file.seek(line#)` to go directly there. – flybonzai Jan 26 '16 at 17:26
  • This is a repeat question... http://stackoverflow.com/questions/2081836/reading-specific-lines-only-python Also here http://stackoverflow.com/questions/2444538/go-to-a-specific-line-in-python Which recommends that you use pythons builtin **linecache.getline** – kpie Jan 26 '16 at 17:51
  • Do you only ever want to go just one line back? How far forward do you want to go? Do you need full random access? – Steven Rumbalski Jan 26 '16 at 17:53

2 Answers2

4

Use csvreader.next() to get to the next line. To get 10 lines forward, call it 10 times or use a in-range loop.

Use csvreader.line_num to get the current line number. Thanks to "Steven Rumbalski" for pointing out, that you can only trust in this if your data contains no newline-characters (0x0A).

To get the line before the current line, simpy cache the last row in a variable.

More information here: https://docs.python.org/2/library/csv.html

Edit

A Small example: import csv

reader = csv.DictReader(open('Sourcefile.txt','rt'), delimiter = '\t')

last_line = None

for row in reader:
    print("Current row: %s (line %d)" % (row, reader.line_num));

    # do Sth with the row

    last_line = row
    if reader.line_num % 10 == 0:
        print("Modulo 10! Skipping 5 lines");
        try:
            for i in range(5):
                last_line = reader.next()
        except: # File is finished
            break

This does exactly the same, but in my eyes it is better code: import csv

reader = csv.DictReader(open('Sourcefile.txt','rt'), delimiter = '\t')

last_line = None

skip = 0
for row in reader:
    if skip > 0:
        skip -= 1
        continue;

    print("Current row: %s (line %d)" % (row, reader.line_num));

    # do Sth with the row

    last_line = row
    if reader.line_num % 10 == 0:
        print("Modulo 10! Skipping 5 lines");
        skip += 5
print("File is done!")
Mijago
  • 1,569
  • 15
  • 18
  • 2
    [`csvreader.line_num`](https://docs.python.org/2/library/csv.html#csv.csvreader.line_num) "The number of lines read from the source iterator. This is not the same as the number of records returned, as records can span multiple lines." Translation: `line_num` can only be trusted if your fields do not contain newlines. – Steven Rumbalski Jan 26 '16 at 17:40
  • That is a good point. I edited it into the post. But when do you use a NewLine Character inside your data without escaping it? – Mijago Jan 26 '16 at 17:46
  • Literal unescaped newlines can occur inside quoted fields. – Steven Rumbalski Jan 26 '16 at 17:50
-1

For maximal flexibility (and memory use) you can copy the whole csv instance into an array. Effectively caching the whole table.

import csv
reader = csv.DictReader(open('Sourcefile.txt','rt'), delimiter = '|')
fn = reader.fieldnames
t = []
for k in reader.__iter__():
    t.append(k)

print(fn)
print(t[0])
# you can now access a row (as a dictionary) in the list t[0] is the second row in the file and fn is the first
# Fn is a list of keys that can be applied to each row t
# t[0][fn[0]] gives the row name of the first row
# fn is a list so the order of the columns is preserved.
# Each element in t is a dictionary, so to preserve the columns we use fn
kpie
  • 9,588
  • 5
  • 28
  • 50
  • Why not `t = list(reader)`? OP asked for something memory efficient. Reading the whole file into memory is the opposite of that. – Steven Rumbalski Jan 26 '16 at 17:49
  • "upwards of 160MB". And that's before being made into dictionaries. The space in memory will by much larger. – Steven Rumbalski Jan 26 '16 at 17:57
  • How would I get at the individual elements defined by fieldnames in any particular row? For instance suppose field names are AA, BB, CC, DD, and I need to extract value of AA in line 1, CC in line 7, etc. – DyTech Jan 26 '16 at 20:40