Does Pandas have a dataframe length limit?

Question

I want to create a system where I load and analyze large amounts of data into pandas. Also, I will later use this to write back to .parquet files

when I try to test this using a simple example, I see that there is some kind of built in limit on the number of rows

import pandas as pd

# Create file with 100 000 000 rows
contents = """
Tommy;19
Karen;20
"""*50000000

open("person.csv","w").write(
f"""
Name;Age
{contents}
"""
)
print("Test generated")

df = pd.read_csv("person.csv",delimiter=";")
len(df)

returns 10 000 000. Not 100 000 000

perhaps a duplicate of https://stackoverflow.com/questions/23569771/maximum-size-of-pandas-dataframe ? — Regressor, Feb 12 '22 at 07:07
@Corralien if this is related to memory limit, why it does restrict by exactly 10 million rows? — Øyvind Rogne, Feb 12 '22 at 07:27

score 3 · Answer 1 · answered Feb 12 '22 at 08:24

Change the method to create the file because I think you have to many blank rows and you don't close properly your file (without context manager or explicit close() method):

# Create file with 100 000 000 rows
contents = """\
Tommy;19
Karen;20
"""*50000000

with open('person.csv', 'w') as fp:
    fp.write('Name;Age\n')
    fp.write(contents)

Read the file:

df = pd.read_csv('person.csv', delimiter=';')
print(df)

# Output
           Name  Age
0         Tommy   19
1         Karen   20
2         Tommy   19
3         Karen   20
4         Tommy   19
...         ...  ...
99999995  Karen   20
99999996  Tommy   19
99999997  Karen   20
99999998  Tommy   19
99999999  Karen   20

[100000000 rows x 2 columns]

score 0 · Answer 2 · answered Feb 12 '22 at 07:15

0

I don't think there is a limit , but there is a limit to how much it can process at a time, but that u can go around it by making code more efficient..

currently I am working with around 1-2 million rows without any issues

answered Feb 12 '22 at 07:15

nobcoders

23
5

ok. I will probably be working with 2 million rows on average, but the example above indicates that more than 10 million does not work – Øyvind Rogne Feb 12 '22 at 07:24
1

that is memory limit not DataFrame limitation and also when we use large data csv format becomes bit troublesome to use, so for me personally i use pickle, easy and fast – nobcoders Feb 12 '22 at 07:28
if u try the above code in DataFrame instead of list u will get the desired result – nobcoders Feb 12 '22 at 07:41
there is a limitation to Pandas. This is why there is PySpark for big data. – Nguai al Jun 12 '22 at 05:48

score 0 · Answer 3 · answered Feb 12 '22 at 08:36

0

The main bottleneck is your memory, Pandas uses NumPy under the hood. So you can fit 10M rows until it's not an issue for your computer

answered Feb 12 '22 at 08:36

peerpressure

390
4
19

Does Pandas have a dataframe length limit?

3 Answers3