0

I have a directory with lots of large csv-files, each formatted the same way. Each file is too large to be imported into memory.

My problem is that pandas.read_csv() only lets me read one file at a time, I want pandas.read_csv() to treat all the files in the directory as one big file (that means, I want pandas to treat them as if the files were joined end-to-end). I do this such that I can read chunkwise from the files seamlessly. How can I do this most effectively? Performance is very important since the files are so large.

EDIT: I want to read to be treated as a single file because each chunk must have the same size, and also divisible by the total size of all the files (and not the individual file size)

Mikkel Rev
  • 863
  • 3
  • 12
  • 31
  • 1
    chunksize is a pandas.read_csv() parameter. Please go through the documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html – Bikash Ranjan Bhoi Aug 14 '18 at 11:39
  • Why do you want to treat them as one big file, I would rather use `pd.read_csv(filename, chunksize=some size)` to process them. – quest Aug 14 '18 at 11:39
  • Because I must have each chunk same size, and also divisible by the total size of all the files (and not the individual file size) – Mikkel Rev Aug 14 '18 at 14:31
  • I think [this](https://stackoverflow.com/a/46310416/3944322) is what you're looking for (making a file-like object from your csv files to be used with read_csv). – Stef Aug 14 '18 at 16:27

0 Answers0