Requirements:
- Fast - for any size / length csv
- Fast - processing time only based on row length and row count from EOF
- No additional dependencies allowed
Code:
import pandas as pd
import io
import sys
def get_csv_tail(filepath, max_rows=1):
with open(filepath, "rb") as f:
first = f.readline().decode(sys.stdout.encoding) # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
count = 0
while count < max_rows: # Until we've gone max_rows back
try:
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
count = count + 1
f.seek(-2, 1) # ...jump back the read byte plus one more.
f.seek(1, 1) # move forward one byte
tail = f.read().decode(sys.stdout.encoding) # We found our spot; read from here through to the end of the file.
f.close()
return io.StringIO(first + tail)
df = pd.read_csv(get_csv_tail('long.csv', max_rows=5)) # Get the last five rows as a df
WARNING: this assumes your csv only contains newline characters at EOL positions, which is not true for all csv files.
This also pulls the header so the columns are read correctly into pandas. If you don't need that, you can remove the first line after the file open and modify the function return to only process the tail.
Based on What is the most efficient way to get first and last line of a text file?