1

I have 638 Excel files in a directory that are about 3000 KB large, each. I want to concatenate all of them together, hopefully only using Python or command line (no other programming software or languages).

Essentially, this is part of a larger process that involves some simple data manipulation, and I want it all to be doable by just running a single python file (or double clicking batch file).

I've tried variations of the code below - Pandas, openpyxl, and xlrd and they seem to have about the same speed. Converting to csv seems to require VBA which I do not want to get into.

temp_list=[]
for filename in os.listdir(filepath):
    temp = pd.read_excel(filepath + filename,
                        sheet_name=X, usecols=fields)
    temp_list.append(temp)

Are there simpler command line solutions to convert these into csv files or merge into one excel document? Or is this pretty much it, just using the basic libraries to read individual files?

ksoangxhc
  • 21
  • 1
  • 5

1 Answers1

4

.xls(x) is a very (over)complicated format with lots of features and quirks accumulated over the years and is thus rather hard to parse. And it was never designed for speed or for large amounts of data but rather for ease of use for business people.

So with your number of files, your best bet is to convert those to .csv or another easy-to-parse format (or use such a format for data exchange in the first place) -- and preferrably, do this before you get to process them -- e.g. upon a file's arrival.

E.g. this is how you can save the first sheet of a .xls(x) to .csv with pywin32 using Excel's COM interface:

import win32com.client
# Need the typelib metadata to have Excel-specific constants
x = win32com.client.gencache.EnsureDispatch("Excel.Application")
# Need to pass full paths, see https://stackoverflow.com/questions/16394842/excel-can-only-open-file-if-using-absolute-path-why
w = x.Workbooks.Open("<full path to file>")
s = w.Worksheets(1)
s.SaveAs("<full path to file without extension>",win32com.client.constants.xlCSV)
w.Close(False)

Running this in parallel would normally have no effect because the same server process would be reused. You can force creating a different process for each batch as per How can I force python(using win32com) to create a new instance of excel?.

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
  • To add to this answer, you could try to find a non-python command-line tool that converts xls(x) files to csv or tsv and does so quickly (because it's written in something faster than Python), then concatenate the converted files using a simple `cat` (make sure you omit the headers!). Ideally you'd want something that accepts multiple filenames or a wildcard pattern on a single invocation. Check out some of these options: https://linoxide.com/linux-how-to/methods-convert-xlsx-format-files-csv-linux-cli/ – Avish Apr 08 '19 at 20:29
  • @Avish Python's speed is irrelevant here because all the heavy lifting is done by Excel. – ivan_pozdeev Apr 08 '19 at 20:30
  • It's done in Excel if you're using Excel's COM interface to read the XLSX. The OP uses pandas `read_excel`, which (I hope?) isn't using Excel to read the files but rather some other implementation. I'm trying to suggest reading it using some other command-line tool which is faster at reading Excel files than Excel or pandas are. – Avish Apr 08 '19 at 21:04
  • It seems like the file path in the in `workbooks.Add()` defaults to Users/user/Documents for some reason, even though all of my files and code are in a path completely split from Documents. This leads to a com_error which asks me if the files have been moved or replaced. Does this have to do with any of the steps leading up to this point, like `win32com.client.gencache.EnsureDispatch("Excel.Application")` ? – ksoangxhc Apr 08 '19 at 21:29
  • @ksoangxhc Yes, you need to pass a full path, see https://stackoverflow.com/questions/16394842/excel-can-only-open-file-if-using-absolute-path-why why. I also changed the code to use `Open`. (`Add` with parameter is for creating from a template which is what I did in my last project so it kinda stuck in my head; there's no difference in your specific use case though). – ivan_pozdeev Apr 08 '19 at 21:52
  • @ksoangxhc If the answer resolved the issue, please consider [accepting](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) it. – ivan_pozdeev May 10 '19 at 06:17