I'm extracting data into an Excel table from 24 text files (ranging in size from 12 KB to 3.5 MB) using Pandas. The total size of the files is 50 MB.
The format of the text files is such that the columns aren't standardized. I read in each file line-by-line and proceed through various logic checks with for loops, then write the line to a dictionary. I convert this dictionary to a 1-row dataframe that I keep appending to.
I have been running the script (just to compile the dataframe without saving to Excel) for the past 3 hours (with 2 files left to go). Separately, I manually processed the 24 files in chunks of 5-10, and each of those runs only took about 10 minutes. I'm a Python memory management noob and would like to understand what is going on and how I can improve.
My guess is that it's taking such a long time because of the large dataframe Python has to hold in memory as it is adding records to it. If this is the case, I may need to figure out how to write every row straight to Excel. What's complicating this, of course, is the presence of the aforementioned unstandardized columns.
Even if I were to batch process the files in groups of 4, stitching them together in Python would likely again take a long time.