Way to read first few lines for pandas dataframe

Question

Is there a built-in way to use read_csv to read only the first n lines of a file without knowing the length of the lines ahead of time? I have a large file that takes a long time to read, and occasionally only want to use the first, say, 20 lines to get a sample of it (and prefer not to load the full thing and take the head of it).

If I knew the total number of lines I could do something like footer_lines = total_lines - n and pass this to the skipfooter keyword arg. My current solution is to manually grab the first n lines with python and StringIO it to pandas:

import pandas as pd
from StringIO import StringIO

n = 20
with open('big_file.csv', 'r') as f:
    head = ''.join(f.readlines(n))

df = pd.read_csv(StringIO(head))

It's not that bad, but is there a more concise, 'pandasic' (?) way to do it with keywords or something?

To see how to load the last _N_ lines checkout [this SO post](http://stackoverflow.com/questions/17108250/efficiently-read-last-n-rows-of-csv-into-dataframe) — zelusp, Sep 27 '16 at 03:09

score 227 · Accepted Answer · edited Sep 27 '16 at 04:53

227

I think you can use the nrows parameter. From the docs:

nrows : int, default None

    Number of rows of file to read. Useful for reading pieces of large files

which seems to work. Using one of the standard large test files (988504479 bytes, 5344499 lines):

In [1]: import pandas as pd

In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

In [3]: len(z)
Out[3]: 20

In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s

edited Sep 27 '16 at 04:53

zelusp

3,500
3
31
65

answered Feb 21 '13 at 18:00

DSM

342,061
65
592
494

4

`skiprows=None` is also a useful parameter to remember – Nitin Jul 21 '18 at 22:59
What's the best way to load the last n rows? Basically what tail() does, but I need to use it while loading the csv. Thanks in advance! – Danail Petrov Mar 20 '20 at 08:29
@DanailPetrov: Use `skiprows`, something like `df = pd.read_csv(..., skiprows=total_rows - n, nrows=n)` – Chau Pham Jan 06 '21 at 05:13
can you elaborate on that? What is `total_rows` in this case? Custom function? – Danail Petrov Jan 07 '21 at 09:55

score 5 · Answer 2 · answered Jul 30 '21 at 14:37

5

I would use 'skiprows' argument in read_csv, e.g.,:

df = pd.read_csv(filename, skiprows=range(2, 20000), nrows=10000)

answered Jul 30 '21 at 14:37

Wissam

61
1
1

Way to read first few lines for pandas dataframe

2 Answers2

Linked

Related