1

I want to create a system where I load and analyze large amounts of data into pandas. Also, I will later use this to write back to .parquet files

when I try to test this using a simple example, I see that there is some kind of built in limit on the number of rows

import pandas as pd

# Create file with 100 000 000 rows
contents = """
Tommy;19
Karen;20
"""*50000000

open("person.csv","w").write(
f"""
Name;Age
{contents}
"""
)
print("Test generated")

df = pd.read_csv("person.csv",delimiter=";")
len(df)

returns 10 000 000. Not 100 000 000

Øyvind Rogne
  • 21
  • 1
  • 4

3 Answers3

3

Change the method to create the file because I think you have to many blank rows and you don't close properly your file (without context manager or explicit close() method):

# Create file with 100 000 000 rows
contents = """\
Tommy;19
Karen;20
"""*50000000

with open('person.csv', 'w') as fp:
    fp.write('Name;Age\n')
    fp.write(contents)

Read the file:

df = pd.read_csv('person.csv', delimiter=';')
print(df)

# Output
           Name  Age
0         Tommy   19
1         Karen   20
2         Tommy   19
3         Karen   20
4         Tommy   19
...         ...  ...
99999995  Karen   20
99999996  Tommy   19
99999997  Karen   20
99999998  Tommy   19
99999999  Karen   20

[100000000 rows x 2 columns]
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

I don't think there is a limit , but there is a limit to how much it can process at a time, but that u can go around it by making code more efficient..

currently I am working with around 1-2 million rows without any issues

nobcoders
  • 23
  • 5
  • ok. I will probably be working with 2 million rows on average, but the example above indicates that more than 10 million does not work – Øyvind Rogne Feb 12 '22 at 07:24
  • 1
    that is memory limit not DataFrame limitation and also when we use large data csv format becomes bit troublesome to use, so for me personally i use pickle, easy and fast – nobcoders Feb 12 '22 at 07:28
  • if u try the above code in DataFrame instead of list u will get the desired result – nobcoders Feb 12 '22 at 07:41
  • there is a limitation to Pandas. This is why there is PySpark for big data. – Nguai al Jun 12 '22 at 05:48
0

The main bottleneck is your memory, Pandas uses NumPy under the hood. So you can fit 10M rows until it's not an issue for your computer

peerpressure
  • 390
  • 4
  • 19