0

I'm trying to figure out which one is generally faster for a similar task: using VBA or openpyxl.

I know it probably depends on the task you want to achieve, but let's say I have a table that is 50 cells wide and 150,000 cells tall and I want to copy it from woorkbook A to workbook B.

Any thoughts on whether python will do better or if Excel is better in dealing with itself?

My guts tell me that python should be fairly faster for some reasons:

  • In order for a sub to copy from a workbook to another, both should be open and running, whereas with python I can simply load both;
  • VBA has to deal with a lot of clutter with most tasks and it takes A LOT of system resources

Besides that, I'd like to know if I can make some further improvements to a openpyxl script, like multithreading or perhaps using NumPy along with it.

Thanks for the help!

Vitu Tomaz
  • 61
  • 1
  • 8
  • 2
    The simplest way to answer your question is to time both approaches. No need to guess. "VBA has to deal with a lot of clutter with most tasks and it takes A LOT of system resources" - what is this based on? – Tim Williams Feb 13 '16 at 19:39
  • I'm not the most experienced VBA coder, but from my experience and from what I know, making VBA efficient is a very hard task, and very often it gets really slow for managing lots of data. Also, I started writing this routine with both resources, and, in the case of VBA, I need both spreadsheets open, I have to activate them every time I need handle the other one, it refreshes instantly, etc, etc – Vitu Tomaz Feb 13 '16 at 21:08
  • 2
    I would say it's definitely possible to make VBA *inefficient* if you're not experienced with it, but likely the same can be said for any language. One big plus VBA has going for it in performance terms is that it runs in the same process as Excel, so there is no cross-process overhead associated with automating excel from VBA: this can become quite significant if you make a lot of calls to Excel from another process. Working with arrays instead of cell-by-cell whenever possible can help minimize this but it's definitely a factor. – Tim Williams Feb 15 '16 at 00:29
  • I think that's my case. I'm much more comfortable with Python then I do with VBA. Besides, I think I would have to make a single call to each file – Vitu Tomaz Feb 16 '16 at 00:53

3 Answers3

3

TBH the fastest approach would probably be remote controlling Excel using xlwings, because this can take advantage of Excel's optimisation. VBA might be able to hook into that as well but I've never found VBA to be fast.

Python will have to convert from XML to Python and back to XML. You've got around 5,000,000 million cells so I'd expect this to take about a minute on my machine. I'd suggest combining read-only and write-only modes to do this to keep memory use low.

If you only have numerical data (no dates) then you might be able to find a shortcut and "transplant" the relevant worksheet XML file from one Excel file to another and just alter the relevant metadata.

Charlie Clark
  • 18,477
  • 4
  • 49
  • 55
  • I didn't know xlwings, but I'll surely take a look! I didn't think about the downside of converting between XML and Python. What do you mean with "_combining read-only and write-only modes_"? And the spreadsheet has a lot of text and dates.... Anyway, thanks a lot for the help! – Vitu Tomaz Feb 14 '16 at 14:03
  • 1
    Some pseudo-code: `wb1 = load_workbook("file.xlsx", read_only=True); wb2=Workbook(write_only=True); ws1=wb1.active; ws2.wb2.active; for row in ws1.iter_rows(): ws.2.append([c.value for c in row])`. Having dates will slow things down a bit because Excel requires them to be formatted. – Charlie Clark Feb 14 '16 at 15:19
  • Have you ever tried [making a direct data connection to Excel](http://stackoverflow.com/a/40332696/111794)? – Zev Spitz Oct 31 '16 at 20:07
1

TL;DR Consider making a direct data connection to the Excel file (ADO in VBA or Python+PyWin32, pyodbc in Python, or the .NET OleDbConnection class, among others). The language in which you make such a connection is much less relevant.

Long version

If all you want is to work with the data itself, you might want to consider a direct connection to Excel using ADO, pyodbc, or the .NET OleDbConnection class.

Automating the Excel application (with the Microsoft Excel object model, or (presumably) with xlwings) incurs a lot of overhead, which is understandable, because you might not be only reading the data in the Excel file, but also manipulating all the objects in the Excel UI — windows, menus — as well as objects beyond the data, such as formatting on individual cells or ranges.

It's true that openpyxl doesn't have all this overhead of UI elements, because it's reading the file directly, but I'm presuming there is still some overhead incurred because openpyxl has to make available all the information in the file, which is more than just the data — cell formatting, for example.

Making a data connection also allows you to treat the Excel file as a database, to which you can issue SQL statements, with all the power of SQL -- joins, sorting, grouping, aggregates.

See here for an example using ADO and VBA.

Community
  • 1
  • 1
Zev Spitz
  • 13,950
  • 6
  • 64
  • 136
0

With openpyxl ...

This link was really helpful for me:

https://blog.dchidell.com/2019/06/24/openpyxl-poor-performance-optimisation/

  1. Use read_only when opening the file if all you're doing is reading.

  2. Use the built in iterators!

I cannot stress this enough - the iterators are fast, crazy fast.

  1. Call functions as infrequently as possible and store intermediate data in variables. It may bulk the code up a bit, but it tends to be more efficient and also allows your code to be more readable (but this is icing on the cake compared to points 1 and 2). Python can also be ambiguous as to what is a variable and what is a function; but as a general rule intermediate variables are good for multiple function calls.

I was doing some reading of values in a particular workbook, and I did this initially:

wb = load_workbook(filename)

And that would take nearly 80 seconds. Caching the workbook between actions with it was helpful but still painful every time I reloaded my script.

I switched to reading only.

wb = load_workbook(filename, data_only=True, read_only=True)

Now it only takes < 0.1 seconds.

phyatt
  • 18,472
  • 5
  • 61
  • 80