How to read head and tail of CSV file with python

Question

I have a csv file with a timestamp field where first line indicates start time and last line specifies end time as a time frame. How can I get them using python?

CSV file:

run,a,b,2015-10-25T18:02:30.798426Z  
run,c,d,2015-10-25T18:02:30.807375Z
run,e,f,2015-10-25T18:02:30.809113Z
run,g,h,2015-10-25T18:02:30.825410Z
run,i,j,2015-10-25T18:02:30.843917Z
run,k,l,2015-10-25T18:02:30.850492Z
run,m,n,2015-10-25T18:02:30.858041Z
run,o,p,2015-10-25T18:02:30.859345Z
run,q,r,2015-10-25T18:02:30.862365Z

Thanks.

How about this one? http://stackoverflow.com/questions/3346430/most-efficient-way-to-get-first-and-last-line-of-file-python — Aung, Oct 28 '15 at 23:15

score 1 · Accepted Answer · edited May 23 '17 at 11:44

1

If you already know the lines are ordered by time, you can just do something like:

import csv
import dateutil.parser

with open('file.csv') as f: 
   reader = csv.reader(f)
   first = dateutil.parser.parse(reader.next()[3])
   for row in reader:
      pass
last = dateutil.parser.parse(row[3])

print('%s - %s' % (first, last))
# OUTPUTS: 
# 2015-10-25T18:02:30.798426Z - 2015-10-25T18:02:30.862365Z

If you then want to get first and last back into a datetime object (from isoformat), you can use dateutil.parser as in this answer e.g.:

import dateutil.parser
first = dateutil.parser.parse(first)

edited May 23 '17 at 11:44

Community

1
1

answered Oct 28 '15 at 23:09

lemonhead

5,328
1
13
25

Don't forget you have to `import csv` first. – McGlothlin Oct 28 '15 at 23:27
@lemonhead thanks. Just a minor change in csv file. what if timestamp row number changes. I've updated the post. – hossein Oct 29 '15 at 09:26

score 1 · Answer 2 · answered Oct 29 '15 at 02:53

1

The answer provided above works but involves reading the entire file. If you are on a unix system...

# assume CSV file like
# a,b,1
# a,b,2
# a,b,3
# ...
# a,b,234934

import subprocess

# get first N lines of CSV file into array
how_many_lines_in_head = '1'
head_args = ['head', '-n', how_many_lines_in_head, 'input.csv']
head_str = subprocess.check_output(head_args)
first_timestamp = head_str.split(',')[-1].replace('\n','')

# do the same for tail end of file
how_many_lines_in_tail = '1'
tail_args = ['tail', '-n', how_many_lines_in_tail, 'input.csv']
tail_str = subprocess.check_output(tail_args)
last_timestamp = tail_str.split(',')[-1].replace('\n','')

# i'm assuming unix system here so line endings are \n

answered Oct 29 '15 at 02:53

chill_turner

499
4
6

Unfortunately, the CSV format can include newlines in quoted fields. So for 100% accuracy, you'd either have to implement a reverse row parser to figure this out, or just give up and do what the other answer suggests, since a `\n` is only a record separator under certain contexts. – ShadowRanger Oct 29 '15 at 02:57
Hi there. You are correct that not EVERY CSV file can be used with my code. Really the key idea with my solution is that you are relying on unix tools which are built for this exact purpose instead of reading the entire file into python. Also i'd argue that is answer IS 100% accurate based on what the original question was. There are no '\n' characters in the CSV snippet posted originally so my solution doesn't assess that possibility! Also the thought of parsing a single line of CSV file with 3 columns is not really anything that should be causing us to just give up! – chill_turner Oct 29 '15 at 14:17
Problem is, you're demonstrating many "Don't"s of csv processing here. This can't handle quoted fields at all (let alone the admittedly uncommon case of fields with newlines). Even if we assume there is a speed benefit, getting the wrong answer fast is wrongheaded. And performing the processing in subprocesses adds significant overhead for process management; unless the `csv` is huge, you probably won't beat an optimized "in-python" solution. – ShadowRanger Oct 29 '15 at 16:04
For example, if we're okay with ignoring embedded newlines (but otherwise want to handle quoted fields properly), you could `mmap` the input file, process the first line normally, then use `rfind(b"\n")` to figure out where the last line begins, read and process it (with proper `csv` module parser) and you've avoided reading any parts of the file you don't need, avoided all subprocesses, and correctly parsed per real CSV rules, not "split on commas and hope". – ShadowRanger Oct 29 '15 at 16:07
Totally agree with everything you've said here =). Thanks! – chill_turner Oct 29 '15 at 19:15

How to read head and tail of CSV file with python

2 Answers2