As per your comment, to read four different sheets of that size into a single df you would first have to
Make sure all sheets have roughly the same column count AND column names. If the names don't match then new columns will be made, making your df exponentially larger but with nan
s in many cells.
Load and process one sheet at a time so you can dispose of the original in-memory loaded data before reading the next sheet.
You shouldn't run out of memory, as reading an 8k*600 sheet should take about 40MB, but in case you come across larger datasets:
If you still run out of memory, try casting to an np.float32
, possibly compromising some precision and a lot of performance, but getting half the memory usage.
Or you can process one sheet and save it to a binary file such as parquet or hdf before moving on to the next one. Ultimately if you need to do computations on the full dataset (and the set is too large for memory) you might need to look into out-of-core algorithms, but that is another story.
If it all fits in memory, then you can read with
df = [] # list of dfs
for sht in list_with_sheetnames:
buffer = pd.read_excel(pth+filename, sheet_name=sht)
df.append(buffer.groupby(buffer.index // 2).sum())
del buffer
df = pd.concat(df, ignore_index=True) # convert from list of dfs to single df
You can make a grouper that increases every two rows and then sum the rows.
After you read the data from excel:
df = df.groupby(df.index // 2).sum()