Efficiently Read last 'n' rows of CSV into DataFrame

Question

A few methods to do this:

Read the entire CSV and then use df.tail
Somehow reverse the file (whats the best way to do this for large files?) and then use nrows argument to read
Somehow find the number of rows in the CSV, then use skiprows and read required number of rows.
Maybe do chunk read discarding initial chunks (though not sure how this would work)

Can it be done in some easier way? If not, which amongst these three should be prefered and why?

Possibly related:

Not directly related:

How to get the last n row of pandas dataframe?

You could also `seek()` and then go backwards. – Burhan Khalid Jun 14 '13 at 12:25 — Burhan Khalid, Jun 14 '13 at 12:25

score 35 · Accepted Answer · edited May 23 '17 at 11:47

35

I don't think pandas offers a way to do this in read_csv.

Perhaps the neatest (in one pass) is to use collections.deque:

from collections import deque
from StringIO import StringIO

with open(fname, 'r') as f:
    q = deque(f, 2)  # replace 2 with n (lines read at the end)

In [12]: q
Out[12]: deque(['7,8,9\n', '10,11,12'], maxlen=2)
         # these are the last two lines of my csv

In [13]: pd.read_csv(StringIO(''.join(q)), header=None)

Another option worth trying is to get the number of lines in a first pass and then read the file again, skip that number of rows (minus n) using read_csv...

edited May 23 '17 at 11:47

Community

1
1

answered Jun 14 '13 at 13:05

Andy Hayden

359,921
101
625
535

1

if you are using python 3.x, replace 'from StringIO import StringIO' by 'from io import StringIO'. Check https://stackoverflow.com/questions/11914472/stringio-in-python3/18284900#18284900 – xiaxio Jan 27 '20 at 04:03
Thank you. This is really fast in my case. Tested with a csv with shape `(9020057, 4)` and retrieving the last `10000` rows. Compared it with `df.tail(10000)`, `df.iloc[-10000:]` – Rakesh Jun 27 '20 at 19:47
1

sometimes it's helpful to have the headers from the first line in the file. I use this bit of code to get the last N lines with the header at the start. with open(product_filename, 'r') as f: q = [ f.readline() ] q.extend(deque(f,ndays)) df = pd.read_csv(StringIO(''.join(q))) – rocketman Dec 13 '20 at 00:29

score 13 · Answer 2 · edited Nov 17 '21 at 08:36

13

Here's a handy way to do. Works well for what I like to do -

import tailer
import pandas as pd
import io

with open(filename) as file:
    last_lines = tailer.tail(file, 15)

df = pd.read_csv(io.StringIO('\n'.join(last_lines)), header=None)

You need to install tailer, to have this working:

pip install --user tailer

edited Nov 17 '21 at 08:36

Rafael

7,002
5
43
52

answered Aug 25 '17 at 12:05

Parikshit Bhinde

475
8
15

1

I tried several ways to read last n lines in a csv file, including the ones posted on this thread and also some on this other question: https://stackoverflow.com/questions/38704949/read-the-last-n-lines-of-a-csv-file-in-python-with-numpy-pandas, and the one by @Parikshit Bhinde was the fastest – xiaxio Jan 30 '20 at 02:03

score 7 · Answer 3 · answered Jun 14 '13 at 12:43

Files are simply streams of bytes. Lines do not exist as separate entities; they are an artifact of treating certain bytes as newline characters. As such, you must read from the beginning of the file to identify lines in order.

If the file doesn't change (often) and this is an operation you need to perform often (say, with different values of n), you can store the byte offsets of the newline characters in a second file. You can use this much-smaller file and the seek command to quickly jump to a given line in the first file and read from there.

(Some operating systems provide record-oriented files that have more complex internal structure than the common flat file. The above does not apply to them.)

score 3 · Answer 4 · answered Dec 11 '18 at 21:04

3

Since you are considering reversing the file, I assume it's OK to create new files.

create a new file with the last n lines. tail -n original.csv > temp.csv
add header line to the temp file and generate the new file. head -1 original.csv | cat - temp.csv > newfile.csv && rm -f temp.csv

answered Dec 11 '18 at 21:04

Yi Wu

31
1

As a side remark, use [tempfile](https://docs.python.org/3/library/tempfile.html) to safely create a temporary file. If you already have a file named `temp.csv`, you may be in trouble. – Antoine Apr 24 '19 at 16:16

score 3 · Answer 5 · answered Oct 06 '21 at 16:22

3

The third option is what I use:

Somehow find the number of rows in the CSV, then use skiprows and read required number of rows.

This is my suggestion:

import pandas as pd 

# User inputs
fname = 'test_file.csv'
tail_len = 15

# The two steps in the description
n_rows = sum(1 for row in open(fname, 'r'))
df = pd.read_csv(fname, skiprows=range(1, n_rows - tail_len))

About "somehow" I took the idea from here.

answered Oct 06 '21 at 16:22

J. Javier Gálvez

71
5

Maybe skiprows=range(0, ...) instead of starting from 1 ! – Mustafah Feb 19 '23 at 06:01

chazzmoney · Answer 6 · 2021-09-16T06:58:11.933

Requirements:

Fast - for any size / length csv
Fast - processing time only based on row length and row count from EOF
No additional dependencies allowed

Code:

import pandas as pd
import io
import sys

def get_csv_tail(filepath, max_rows=1):
    with open(filepath, "rb") as f:
        first = f.readline().decode(sys.stdout.encoding)  # Read the first line.
        f.seek(-2, 2)                                     # Jump to the second last byte.
        count = 0
        while count < max_rows:                           # Until we've gone max_rows back
            try:
                while f.read(1) != b"\n":                 # Until EOL is found...
                    f.seek(-2, 1)                         # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
            count = count + 1
            f.seek(-2, 1)                                 # ...jump back the read byte plus one more.
        f.seek(1, 1)                                      # move forward one byte
        tail = f.read().decode(sys.stdout.encoding)       # We found our spot; read from here through to the end of the file.
        f.close()
                
    return io.StringIO(first + tail)

df = pd.read_csv(get_csv_tail('long.csv', max_rows=5))    # Get the last five rows as a df

WARNING: this assumes your csv only contains newline characters at EOL positions, which is not true for all csv files.

This also pulls the header so the columns are read correctly into pandas. If you don't need that, you can remove the first line after the file open and modify the function return to only process the tail.

Based on What is the most efficient way to get first and last line of a text file?

Nicole Douglas · Answer 7 · 2022-01-16T08:23:11.067

0

You can create a metadata.csv file, and keep track of the length of the csv. Each time you add rows to your csv, update the metadata.csv file with the latest row_count. Next time you load the csv just use the below:

file_size = 139405 #stored in your metadata.csv file
n_bottom_rows = 7
df = pd.read_csv('myfile.csv',skiprows = filesize - n_bottom_rows)

edited Jan 16 '22 at 08:23

answered Jan 16 '22 at 08:17

Nicole Douglas

579
4
14

Efficiently Read last 'n' rows of CSV into DataFrame

7 Answers7

Linked