60

I have a very big csv file so that I can not read it all into memory. I only want to read and process a few lines in it. So I am seeking a function in Pandas which could handle this task, which basic python can handle well:

with open('abc.csv') as f:
    line = f.readline()
    # pass until it reaches a particular line number....

However, if I do this in pandas, I always read the first line:

datainput1 = pd.read_csv('matrix.txt', sep=',', header=None, nrows=1)
datainput2 = pd.read_csv('matrix.txt', sep=',', header=None, nrows=1)

I am looking for some easier way to handle this task in pandas. For example, if I want to read rows from 1000 to 2000. How can I do this quickly?

I want to use pandas because I want to read data into the dataframe.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
lserlohn
  • 5,878
  • 10
  • 34
  • 52
  • 1
    petezurich answer should be accepted. The definition of `nrows` specifically says it's "Useful for reading pieces of large files." – endolith May 02 '22 at 20:39

2 Answers2

86

Use chunksize:

for df in pd.read_csv('matrix.txt', sep=',', header=None, chunksize=1):
    #do something

To answer your second part do this:

df = pd.read_csv('matrix.txt', sep=',', header=None, skiprows=1000, chunksize=1000)

This will skip the first 1000 rows and then only read the next 1000 rows giving you rows 1000-2000, unclear if you require the end points to be included or not but you can fiddle the numbers to get what you want.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
EdChum
  • 376,765
  • 198
  • 813
  • 562
32

In addition to EdChums answer I find the nrows argument useful which simply defines the number of rows you want to import with pandas' read_csv().

Thereby you don't get an iterator but rather can just import a part of the whole file of size nrows. It works with skiprows too.

df = pd.read_csv('matrix.txt', sep=',', header=None, skiprows=1000, nrows=1000)
petezurich
  • 9,280
  • 9
  • 43
  • 57