3

I am having trouble with reading and writing moderately sized excel files in Pandas. I have 5 files each around 300 MB large. I need to combine these files into one, do some processing and then save it (as excel preferably):

import pandas as pd

f1 = pd.read_excel('File_1.xlsx')
f2 = pd.read_excel('File_2.xlsx')
f3 = pd.read_excel('File_3.xlsx')
f4 = pd.read_excel('File_4.xlsx')
f5 = pd.read_excel('File_5.xlsx')

FULL = pd.concat([f1,f2,f3,f4,f5], axis=0, ignore_index=True, sort=False)
FULL.to_excel('filename.xlsx', index=False)'

But unfortunately read takes way too much time (around 15 minutes or so), and write used up 100% of memory (on my 16 GB ram PC), and was taking so much time that I was forced to interrupt the program. Is there any way I could accelerate both read/write?

Ach113
  • 1,775
  • 3
  • 18
  • 40
  • 1
    The problem is the attempt to do everything in memory. You've loaded 5*300MB, that's 1.5GB if not 3GB and more. `xlsx` is a ZIP package containing Excel files so the actual data size can be a lot bigger. Then you create a contatenated frame, that's another 1.5GB (or 3GB) in RAM. Then you try to export this in one go, which means generating the XML content in memory before saving it to a new ZIP package – Panagiotis Kanavos Jan 23 '20 at 11:11
  • 1
    @PanagiotisKanavos I tried using `del` keyword to delete variables before attempting `to_excel()` but memory% remained the same – Ach113 Jan 23 '20 at 11:13
  • That's answered [here](https://stackoverflow.com/questions/32247643/how-to-delete-multiple-pandas-python-dataframes-from-memory-to-save-ram). Remove all references to the intermediate dataframes. Loading the files in a loop, using the same variable for the dataframe and appending it to a "master" dataframe should be enough. You'll still be using 2x-3x more RAM than necessary though – Panagiotis Kanavos Jan 23 '20 at 11:15
  • 1
    Check this question - https://stackoverflow.com/questions/37756991/best-way-to-join-two-large-datasets-in-pandas – Dishin H Goyani Jan 23 '20 at 11:15
  • @DishinHGoyani that's what the OP is already doing and got into trouble. – Panagiotis Kanavos Jan 23 '20 at 11:16
  • @Ach113 if all you need is to concatenate files, not process the data, you can use a SAX-like approach to just read data from each file, each row and write it out to the target file. That's how file copying works by the way. Check [this answer](https://stackoverflow.com/a/15793650/134204) for an example – Panagiotis Kanavos Jan 23 '20 at 11:18
  • @PanagiotisKanavos I need to process data as well. – Ach113 Jan 23 '20 at 11:20

2 Answers2

3

In this post it is defined a nice function append_df_to_excel().

You can use that function to read the files one by one and append their content to the final excel file. This will save you RAM since you are not going to keep all the files in memory at once.

files = ['File_1.xlsx','File_2.xlsx',...]
for file in files:
    df = pd.read_excel(file)
    append_df_to_excel('filename.xlsx', df)

Depending on your input files, you may need to pass some extra arguments to the function. Check the linked post for extra info.

Note that you could use df.to_csv() with mode='a' to append to a csv file. Most of the time you can swap excel files for csv easily. If this is also your case, I would suggest this method instead of the custom function.

alec_djinn
  • 10,104
  • 8
  • 46
  • 71
  • So I just need to create a file (excel or csv) called "full.csv" for instance, and then append read files to it via `to_csv(mode='a')` ? – Ach113 Jan 23 '20 at 11:31
  • If your excel is a simple table, yes. `pd. to_csv("file.csv" , mode="a")` Check on google what a csv is, if it is enough for you, go for it. I personally prefer csv over excel to store tabular data. – alec_djinn Jan 23 '20 at 11:34
  • Ok, I will try that. I have worked with csv before, did not know that writing to it was faster though. – Ach113 Jan 23 '20 at 11:36
  • 1
    @Ach113 `csv` is nothing more than a text file with specific delimiters. That means it's a lot larger too, since there's no compression. Concatenating CSVs is essentially the same as concatenating text files. *Stream processing* is a lot faster too as you're just reading single lines. Writing is easy, even in multithreading scenarios becuse you're just appending lines to a file, something typically supported at the OS level – Panagiotis Kanavos Jan 23 '20 at 12:01
  • 1
    @Ach113 that's why cloud event processing and analytics systems use CSV or JSON-per-line. You can even partition a file simply by seeking to the closest newline, and have each thread (or process) work on a different file block – Panagiotis Kanavos Jan 23 '20 at 12:02
1

Not ideal (and dependent on use case), but I've always found it much quicker to load up the XLSX (in Excel) and save it as a CSV file, just because I tend to do multiple reads on the data and in the long run the time taken to wait for the XLSX load outweighs the amount of time it takes to convert the file.

morganics
  • 1,209
  • 13
  • 27