Reading last N rows of a large csv in Pandas

Question

I have file with 50 GB data. I know how to use Pandas for my data analysis.
I am only in need of the large 1000 lines or rows and in need of complete 50 GB.
Hence, I thought of using the nrows option in the read_csv().
I have written the code like this:

import pandas as pd
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=1000,index_col=0)

But it has taken the top 1000 rows. I am in need of the last 100 rows. So I did this and received error:

df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=-1000,index_col=0)
ValueError: 'nrows' must be an integer >=0

I have even tried using the chunksize in the read_csv(). But it still loads the complete file. And even the output was not DataFrame but iterables.

Hence, please let me know what I can in this scenario.

Please NOTE THAT I DO NOT WANT TO OPEN THE COMPLETE FILE...

@SukumarRdjf I do not want to load the whole file. Please read the question carefully before marking it for duplicate. — Jaffer Wilson, Mar 26 '19 at 08:48
https://stackoverflow.com/questions/17108250/efficiently-read-last-n-rows-of-csv-into-dataframe — b-fg, Mar 26 '19 at 08:55
Do you know an approximation for the number of rows in the file or the length of a line? — Serge Ballesta, Mar 26 '19 at 09:02
No sir. I need to count it. I guess that is not feasible as I need to open the file and count. — Jaffer Wilson, Mar 26 '19 at 09:09
@EdChum It is close to a duplicate of https://stackoverflow.com/questions/17108250/efficiently-read-last-n-rows-of-csv-into-dataframe, but not of last rows of a dataframe! The whole point is precisely not to load the whole file into a dataframe... — Serge Ballesta, Mar 26 '19 at 09:32
@SergeBallesta I've reopened it, I've posted a pure pandas method, there is also a github method [`wcount`](https://gist.github.com/zed/0ac760859e614cd03652) which seems to be the fastest method apparently — EdChum, Mar 26 '19 at 09:45

EdChum · Answer 1 · 2019-03-26T09:52:42.800

A pure pandas method:

import pandas as pd
line = 0
chksz = 1000
for chunk in pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",chunksize = chksz,index_col=0, usecols=0):
    line += chunk.shape[0]

So this just counts the the number of rows, we read just the first column for performance reasons.

Once we have the total number of rows we just subtract from this the number of rows we want from the end:

df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16", skiprows = line - 1000,index_col=0)

Loochie · Answer 2 · 2019-03-26T10:10:32.613

1

I think you need to use skiprows and nrows together. Assuming that your file has 1000 rows, then,

df =pd.read_csv('"Analysis_of_50GB.csv", encoding="utf16",skiprows = lambda x: 0<x<=900, nrows=1000-900,index_col=0)

reads all the rows from 901 to 1000.

edited Mar 26 '19 at 10:10

answered Mar 26 '19 at 09:02

Loochie

2,414
13
20

score 1 · Answer 3 · answered Mar 26 '19 at 09:46

1

You should consider using dask which does chunking under the hood and allows you to work with very large data frames. It has a very similar workflow as pandas and the most important functions are already implemented.

answered Mar 26 '19 at 09:46

Auss

451
5
9

score 1 · Accepted Answer · answered Mar 26 '19 at 10:37

The normal way would be to read the whole file and keep 1000 lines in a dequeue as suggested in the accepted answer to Efficiently Read last 'n' rows of CSV into DataFrame. But it may be suboptimal for a really huge file of 50GB.

In that case I would try a simple pre-processing:

open the file
read and discard 1000 lines
use ftell to have an approximation of what has been read so far
seek that size from the end of the file and read the end of file in a large buffer (if you have enough memory)
store the positions of the '\n' characters in the buffer in a dequeue of size 1001 (the file has probably a terminal '\n'), let us call it deq
ensure that you have 1001 newlines, else iterate with a larger offset
load the dataframe with the 1000 lines contained in the buffer:
```
df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))
```

Code could be (beware: untested):

with open("Analysis_of_50GB.csv", "r", encoding="utf-16") as fd:
    for i in itertools.islice(fd, 1250):      # read a bit more...
        pass
    offset = fd.tell()
    while(True):
        fd.seek(-offset, os.SEEK_END)
        deq = collection.deque(maxlen = 1001)
        buffer = fd.read()
        for i,c in enumerate(buffer):
            if c == '\n':
                deq.append(i)
        if len(deq) == 1001:
            break
        offset = offset * 1250 // len(deq)

df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))

May be this helps.. But will consume a lot time I guess I will try it... — Jaffer Wilson, Mar 26 '19 at 16:02
@JafferWilson: Unsure because it could save a lot of disk io for a huge file. I would be very interested in knowing the results... — Serge Ballesta, Mar 26 '19 at 16:07

Reading last N rows of a large csv in Pandas

4 Answers4