It depends on your system capabilities and how you are processing the data. How do processing each row? What intermediate values are stored? How much history needs to be retained? Etc.
You can import the DataFrame and then use iterrows, but it is not terribly efficient because:
- A new Pandas Series object has to be created for each row
- it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames).
In general, it is best to just read the entire table and then process it if your hardware is not a constraint.
df = pd.DataFrame(np.random.randn(6000, 50000))
>>> df.shape
(6000, 50000)
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6000 entries, 0 to 5999
Columns: 50000 entries, 0 to 49999
dtypes: float64(50000)
memory usage: 2.2 GB
df.to_csv(filename)
The documents describe how to iterate through large files in chunks.
reader1 = pd.read_csv(filename, chunksize=1)
reader2 = pd.read_csv(filename, chunksize=10)
reader3 = pd.read_csv(filename, chunksize=100)
reader4 = pd.read_csv(filename, chunksize=1000)
# Chunksize = 1
%time for row in reader1:temp = row
CPU times: user 2h 11min 27s, sys: 1min 22s, total: 2h 12min 49s
Wall time: 2h 12min 39s
# Chunksize = 10
%time for row in reader2:temp = row
CPU times: user 14min 38s, sys: 11.9 s, total: 14min 50s
Wall time: 14min 50s
# Chunksize = 100
%time for row in reader3:temp = row
CPU times: user 5min 17s, sys: 6.97 s, total: 5min 24s
Wall time: 5min 24s
# Chunksize = 1000
%time for row in reader3:temp = row
CPU times: user 4min 13s, sys: 6.8 s, total: 4min 20s
Wall time: 4min 20s
# Reading the whole file.
%time df2 = pd.read_csv(filename)
CPU times: user 4min 11s, sys: 8.4 s, total: 4min 19s
Wall time: 4min 19s