I have 3000 CSV files stored on my hard drive, each containing thousands of rows and 10 columns. Rows correspond to dates, and the number of rows as well as the exact dates is different across spreadsheets. The columns for all the spreadsheets are the same in number (10) and label. For each date from the earliest date across all spreadsheets to the latest date across all spreadsheets, I need to (i) access the columns in each spreadsheet for which data for that date exists, (ii) run some calculations, and (iii) store the results (a set of 3 or 4 scalar values) for that date. To clarify, results
should be a variable in my workspace that stores the results for each date for all CSVs.
Is there a way to load this data using Python that is both time and memory efficient? I tried creating a Pandas data frame for each CSV, but loading all the data into RAM takes almost ten minutes and almost completely fills up my RAM. Is it possible to check if the date exists in a given CSV, and if so, load the columns corresponding to that CSV into a single data frame? This way, I could load just the rows that I need from each CSV to do my calculations.