0

I wrote a simple example. this script reads an excel file and copy from it only the rows where, in the first column, there is one of the strings specified in the list "ls". the copied rows will be paste in another excel file. below my code:

from datetime import datetime
import openpyxl

print("Current Time =", datetime.now().strftime("%H:%M:%S"))
src_wb = openpyxl.load_workbook("C:\\Users\\Admin\\Desktop\\database.xlsx")
print("Current Time =", datetime.now().strftime("%H:%M:%S"))
print("\n\n")

src_ws = src_wb[src_wb.sheetnames[0]]

dst_wb = openpyxl.Workbook()
dst_ws = dst_wb.active
dst_ws.title = "TEST"

dst_row_id, src_row_id = 0, 0 
ls = ["ciao", "hola", "hello"]
my_column_id = 1

for row in src_ws.rows:
    src_row_id += 1
    #print("Current Time =", datetime.now().strftime("%H:%M:%S"))

    for element in ls:
            if src_ws.cell(src_row_id, my_column_id).value == element:
                dst_row_id += 1
                dst_column_id = 0
                for cell in row:

                    dst_column_id += 1
                    dst_cell = dst_ws.cell(dst_row_id,column=dst_column_id)

                    dst_cell.value = cell.value
                break

dst_wb.save("test.xlsx")
dst_wb.close()

below the output:

λ python stack.py
Current Time = 19:36:38
Current Time = 19:37:20 <- after this point the script is quite fast

this code works good, but when my source excel files are heavy, the instruction src_wb = openpyxl.load_workbook("C:\\Users\\Admin\\Desktop\\database.xlsx") takes many times. in my example, the "database.xlsx" file has 177137 rows, and 14 columns. to open it, openpyxl takes 42 secons, but when it has to read the rows, it seems quite fast.

to improve the situation I tried to use the "read only" mode (openpyxl.load_workbook("C:\\Users\\Admin\\Desktop\\database.xlsx", read_only = True) but in this way the script has become too slow.. it can open the "database.xlsx" file quickly, but in the "copy and paste" is very slow (it's so slow that to complete the task it needs hours!) see the output below:

λ python stack.py
Current Time = 19:37:47
Current Time = 19:37:52 <-- after this point the script is very slow

How can I improve the script performances in read only mode? or maybe in another way, I don't know.. from your experience, what is the best solution to improve my code?

@Daniel Ocando: my excel file is composed by simple strings. below an example:

enter image description here

TurboC
  • 777
  • 1
  • 5
  • 19
  • What are the contexts of the cells in the file? Would you mind showing a sample of how the Excel file looks like? – Daniel Ocando Jun 07 '20 at 18:52
  • # Daniel Ocando, my "database.xlsx" file is composed by simple strings, nothing of special. see my attached screenshot (it's just an example). – TurboC Jun 07 '20 at 19:24
  • Your problems with read-only mode are because you are not following the advice for using it. – Charlie Clark Jun 08 '20 at 15:09

1 Answers1

3

The code you offered has three nested for loops (which in general will mean that the code has an O(n3) time complexity). I managed to remove one of the for loops and therefore reduce the time complexity to O(n2).

You can take advantage that sets are generally faster than lists regarding membership tests and you can change the the ls = ["ciao", "hola", "hello"] list into a set.

Finally, you could take advantage of the write_only mode to try to make the writing much faster.

Taking into consideration that the cells only contain a string with simple words you could try the following code and check if the performance improves:

from datetime import datetime
import openpyxl

def get_list(row):
    list_for_row_values = []
    for cell in row:
        list_for_row_values.append(cell.value)
    return list_for_row_values

print("Current Time =", datetime.now().strftime("%H:%M:%S"))
src_wb = openpyxl.load_workbook("C:\\Users\\Admin\\Desktop\\database.xlsx", read_only=True)
print("Current Time =", datetime.now().strftime("%H:%M:%S"))
print("\n\n")

src_ws = src_wb[src_wb.sheetnames[0]]

dst_wb = openpyxl.Workbook(write_only=True)
dst_ws = dst_wb.create_sheet()

ls = ("ciao", "hola", "hello")

for row in src_ws.rows:
    if(row[0].value) in ls:
        list_for_row_values = get_list(row)
        dst_ws.append(list_for_row_values)

dst_wb.save("test.xlsx")
Daniel Ocando
  • 3,554
  • 2
  • 11
  • 19
  • @ Daniel Ocando: whaoo Daniel, it works! I just adapted my real code with yours and it seems working good. I also added a cache to keep from adding duplicates rows (it's just a simple dictionary). I don't know so much about openpyxl, but it seems that the main difference between my code and yours is the choise to write through the "append" method than the "cell" one. probably in "read only" mode the "cell" method has some performance issues. – TurboC Jun 08 '20 at 00:16
  • Daniel, just to know, in your code, is it possible to copy the cells style too? if it doesn't affect the performances it would be nice to add this feature too. – TurboC Jun 08 '20 at 00:17
  • You'll need to use the copy() module. Find all the information in this other [post](https://stackoverflow.com/questions/23332259/copy-cell-style-openpyxl). – Daniel Ocando Jun 08 '20 at 00:36
  • The point about sets being faster than lists for containment checks is only theoretically valid and almost certainly not relevant here – it's good practice but also premature optimisation. – Charlie Clark Jun 08 '20 at 15:12