0

I'm attempting to extract data from each row of an excel sheet when the number I am testing for each row matches the predefined ranges I have created. The code works, but moves at a crawl. I'm attempting to iterate over a sheet that contains 200,000 rows of data, and noticed that when I break the sheet down into smaller segments the program executes faster, but still not as fast as I would like. I can't quite figure out why that is either.

Any recommendations on how I might get this to speed up?

import xlsxwriter
import openpyxl
import os
from tqdm import tqdm

os.chdir(r'C:\Users\JTNeumay\Desktop\test folder')
xlfile = 'test.xlsx'
wb = openpyxl.load_workbook(xlfile, read_only=True)
sheet = wb['Sheet1']

newbook = xlsxwriter.Workbook('Commercial.xlsx')
newbook2 = xlsxwriter.Workbook('Industrial.xlsx')
newsheet = newbook.add_worksheet()
newsheet2 = newbook2.add_worksheet()
i = 1
j = 1

for row in tqdm(range(1, sheet.max_row + 1)):
    check = sheet.cell(row=row, column=11).value
    if check is not None:
        if (220000 <= check <= 221310 or 221320 <= check <= 221330 or 237200 <= check <= 237210 or 334611 <= check
                <= 334612 or 420000 <= check <= 493190 or 511210 <= check <= 512210 or 512240 <= check <= 541310 or
                541330 <= check <= 541340 or 541370 <= check <= 541700 or 541720 <= check <= 562900 or 562920 <= check
                <= 811430 or 812000 <= check <= 983000):
            for column in range(1, sheet.max_column + 1):
                newsheet.write(i, column, sheet.cell(row=row, column=column).value)
            i += 1
        elif (210000 <= check <= 213115 or 230000 <= check <= 237130 or 237300 <= check <= 334610 or 334613 <=
                check <= 339999 or 510000 <= check <= 511200 or 512220 <= check <= 512230 or check == 541320 or
                check == 541360 or check == 541710 or check == 562910 or check == 811490):
            for column in range(1, sheet.max_column + 1):
                newsheet2.write(i, column, sheet.cell(row=row, column=column).value)
            j += 1
        else:
            pass
    else:
        pass
newbook.close()
newbook2.close()

Note that I also used a method in which I created two lists with the ranges in the if statements, and found that that method took much longer.

STRAT0CAST3R
  • 21
  • 1
  • 7
  • Can you provide a sample of the sheet? Thanks. – Deepak Saini Nov 21 '18 at 04:05
  • I unfortunately cannot, it's for work so the information is proprietary. Unless you mean a sample of the way the sheet is formatted, which is that there is a 6 digit number in the 11th column and the rest of the columns have info that are not important to the program. – STRAT0CAST3R Nov 21 '18 at 18:56
  • Have you tried doing the reading with xlrd or sxl? If you are going to need to load all of the data, xlrd is likely to be fastest, though I would be surprised if it's multiple times faster than openpyxl. Also, if your code is indeed running exponentially slower as the size gets big, that sounds to me like you're running into memory problems (i.e. getting to the point where the operating system is swapping in and out of disk). – John Y Jan 25 '19 at 00:07

1 Answers1

0

Question: Speed Up Loop Time Iterating

Try this openpyxl only solution:

Come back and report your speed experience.

# Create new Workbooks
wb1 = Workbook()
ws1 = wb1.active
wb2 = Workbook()
ws2 = wb2.active

# Predefine Ranges and Target Worksheet
# You can gain additional speed, by sorting the range tuples 
# of the most expected Ranges, to the beginning.
ranges = [(2, 3, ws1), (221320, 221330, ws2), (237200, 237210, ws1), (812000, 983000, ws2)]

# The Row to start
min_row = 2

# Iterate all Cells in Column 'min/max_col'
for row, cell in enumerate(next(ws.iter_cols(min_row=min_row, min_col=2, max_col=2)), min_row):

    if cell.value:
        # Condition: If cell.value in Range(...)
        for r in ranges:
            if (r[0] <= cell.value <= r[1]):
                # Match: Append this 'row' to the given Worksheet
                r[2].append([cell.value for cell in next(ws.iter_rows(min_row=row, max_row=row))])
                break

# Save Worksheets
wb1.save('../test/test1.xlsx')
wb2.save('../test/test2.xlsx')

Tested with Python: 3.4.2 - openpyxl: 2.4.1 - LibreOffice: 4.3.3.2

stovfl
  • 14,998
  • 7
  • 24
  • 51
  • 1
    Gives me just about the same speed that I was getting previously – STRAT0CAST3R Nov 19 '18 at 22:37
  • As it's possible to split your input `xlsx` in chunks, therefore `multiprocessing` is a option. But, you have to pay for it, means 2 processes does not result in **half** excution time. – stovfl Nov 20 '18 at 16:15
  • I ended up using a concurrent.futures.ProcessPoolExecutor map and broke up the source data into multiple sheets to get it to multiprocess, and they did run concurrently. The thing I can't figure out is why analyzing a larger number of cells causes the program to run exponentially longer. it's as if it's loading the whole sheet with each iteration – STRAT0CAST3R Nov 21 '18 at 22:01
  • *"why analyzing a larger number of cells ... run exponentially longer"*: Read [openpyxl-read-excel-too-slow](https://stackoverflow.com/questions/42756266/python-openpyxl-read-excel-too-slow/42828508?s=2|26.0084#42828508) and [openpyxl-optimizing-cells-search-speed](https://stackoverflow.com/questions/34054272/openpyxl-optimizing-cells-search-speed/34060751?s=3|22.2088#34060751) – stovfl Nov 22 '18 at 08:09