29

A few methods to do this:

  1. Read the entire CSV and then use df.tail
  2. Somehow reverse the file (whats the best way to do this for large files?) and then use nrows argument to read
  3. Somehow find the number of rows in the CSV, then use skiprows and read required number of rows.
  4. Maybe do chunk read discarding initial chunks (though not sure how this would work)

Can it be done in some easier way? If not, which amongst these three should be prefered and why?

Possibly related:

  1. Efficiently finding the last line in a text file
  2. Reading parts of ~13000 row CSV file with pandas read_csv and nrows

Not directly related:

  1. How to get the last n row of pandas dataframe?
Community
  • 1
  • 1
Nipun Batra
  • 11,007
  • 11
  • 52
  • 77

7 Answers7

35

I don't think pandas offers a way to do this in read_csv.

Perhaps the neatest (in one pass) is to use collections.deque:

from collections import deque
from StringIO import StringIO

with open(fname, 'r') as f:
    q = deque(f, 2)  # replace 2 with n (lines read at the end)

In [12]: q
Out[12]: deque(['7,8,9\n', '10,11,12'], maxlen=2)
         # these are the last two lines of my csv

In [13]: pd.read_csv(StringIO(''.join(q)), header=None)

Another option worth trying is to get the number of lines in a first pass and then read the file again, skip that number of rows (minus n) using read_csv...

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    if you are using python 3.x, replace 'from StringIO import StringIO' by 'from io import StringIO'. Check https://stackoverflow.com/questions/11914472/stringio-in-python3/18284900#18284900 – xiaxio Jan 27 '20 at 04:03
  • Thank you. This is really fast in my case. Tested with a csv with shape `(9020057, 4)` and retrieving the last `10000` rows. Compared it with `df.tail(10000)`, `df.iloc[-10000:]` – Rakesh Jun 27 '20 at 19:47
  • 1
    sometimes it's helpful to have the headers from the first line in the file. I use this bit of code to get the last N lines with the header at the start. with open(product_filename, 'r') as f: q = [ f.readline() ] q.extend(deque(f,ndays)) df = pd.read_csv(StringIO(''.join(q))) – rocketman Dec 13 '20 at 00:29
13

Here's a handy way to do. Works well for what I like to do -

import tailer
import pandas as pd
import io

with open(filename) as file:
    last_lines = tailer.tail(file, 15)

df = pd.read_csv(io.StringIO('\n'.join(last_lines)), header=None)

You need to install tailer, to have this working:

pip install --user tailer
Rafael
  • 7,002
  • 5
  • 43
  • 52
Parikshit Bhinde
  • 475
  • 8
  • 15
  • 1
    I tried several ways to read last n lines in a csv file, including the ones posted on this thread and also some on this other question: https://stackoverflow.com/questions/38704949/read-the-last-n-lines-of-a-csv-file-in-python-with-numpy-pandas, and the one by @Parikshit Bhinde was the fastest – xiaxio Jan 30 '20 at 02:03
7

Files are simply streams of bytes. Lines do not exist as separate entities; they are an artifact of treating certain bytes as newline characters. As such, you must read from the beginning of the file to identify lines in order.

If the file doesn't change (often) and this is an operation you need to perform often (say, with different values of n), you can store the byte offsets of the newline characters in a second file. You can use this much-smaller file and the seek command to quickly jump to a given line in the first file and read from there.

(Some operating systems provide record-oriented files that have more complex internal structure than the common flat file. The above does not apply to them.)

chepner
  • 497,756
  • 71
  • 530
  • 681
3

Since you are considering reversing the file, I assume it's OK to create new files.

  1. create a new file with the last n lines. tail -n original.csv > temp.csv
  2. add header line to the temp file and generate the new file. head -1 original.csv | cat - temp.csv > newfile.csv && rm -f temp.csv
Yi Wu
  • 31
  • 1
  • As a side remark, use [tempfile](https://docs.python.org/3/library/tempfile.html) to safely create a temporary file. If you already have a file named `temp.csv`, you may be in trouble. – Antoine Apr 24 '19 at 16:16
3

The third option is what I use:

Somehow find the number of rows in the CSV, then use skiprows and read required number of rows.

This is my suggestion:

import pandas as pd 

# User inputs
fname = 'test_file.csv'
tail_len = 15

# The two steps in the description
n_rows = sum(1 for row in open(fname, 'r'))
df = pd.read_csv(fname, skiprows=range(1, n_rows - tail_len))

About "somehow" I took the idea from here.

2

Requirements:

  1. Fast - for any size / length csv
  2. Fast - processing time only based on row length and row count from EOF
  3. No additional dependencies allowed

Code:

import pandas as pd
import io
import sys

def get_csv_tail(filepath, max_rows=1):
    with open(filepath, "rb") as f:
        first = f.readline().decode(sys.stdout.encoding)  # Read the first line.
        f.seek(-2, 2)                                     # Jump to the second last byte.
        count = 0
        while count < max_rows:                           # Until we've gone max_rows back
            try:
                while f.read(1) != b"\n":                 # Until EOL is found...
                    f.seek(-2, 1)                         # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
            count = count + 1
            f.seek(-2, 1)                                 # ...jump back the read byte plus one more.
        f.seek(1, 1)                                      # move forward one byte
        tail = f.read().decode(sys.stdout.encoding)       # We found our spot; read from here through to the end of the file.
        f.close()
                
    return io.StringIO(first + tail)

df = pd.read_csv(get_csv_tail('long.csv', max_rows=5))    # Get the last five rows as a df

WARNING: this assumes your csv only contains newline characters at EOL positions, which is not true for all csv files.

This also pulls the header so the columns are read correctly into pandas. If you don't need that, you can remove the first line after the file open and modify the function return to only process the tail.

Based on What is the most efficient way to get first and last line of a text file?

chazzmoney
  • 221
  • 2
  • 9
0

You can create a metadata.csv file, and keep track of the length of the csv. Each time you add rows to your csv, update the metadata.csv file with the latest row_count. Next time you load the csv just use the below:

file_size = 139405 #stored in your metadata.csv file
n_bottom_rows = 7
df = pd.read_csv('myfile.csv',skiprows = filesize - n_bottom_rows)
Nicole Douglas
  • 579
  • 4
  • 14