0

In python pandas, does the chunksize matter when reading in a large file?

e.g.

df = pd.DataFrame()
for chunk in pd.read_csv('example.csv', chunksize=1000):
    df = pd.concat([df, chunk], ignore_index=True)

Whether I set chunksize to a large or small number, will that help the file load quicker overall?

  • 1
    Possibly, but this is something you probably should experiment with, that will give you your answer – juanpa.arrivillaga Jul 26 '22 at 22:59
  • I was wondering if there was a theoretical best answer/range for different cases, possibly dependent on data types, total rows/columns, memory usage, etc – Nicholas Hansen-Feruch Jul 26 '22 at 23:01
  • It depends. You're trading off processing speed against memory. It takes slightly longer to process the file in chunks, but if that means each chunk fits in memory when the whole file does not, it's a big win. It's only going to be a problem when the file is many hundreds of megabytes. – Tim Roberts Jul 26 '22 at 23:01
  • 2
    Never, pd.concat inside a for loop, this leads to quadratic copying in memory, best to append to a list and pd.concat the list together afterwards. https://stackoverflow.com/a/36489724/6361531 – Scott Boston Jul 27 '22 at 03:02

0 Answers0