Fastest Way To Run Through 50k Lines of Excel File in OpenPYXL

Question

I'm using openpyxl in python, and I'm trying to run through 50k lines and grab data from each row and place it into a file. However.. what I'm finding is it runs incredibely slow the farther I get into it. The first 1k lines goes super fast, less than a minute, but after that it takes longer and longer and longer to do the next 1k lines.

I was opening a .xlsx file. I wonder if it is faster to open a .txt file as a csv or something or to read a json file or something? Or to convert somehow to something that will read faster?

I have 20 unique values in a given column, and then values are random for each value. I'm trying to grab a string of the entire unique value column for each value.

Value1: 1243,345,34,124, Value2: 1243,345,34,124, etc, etc

I'm running through the Value list, seeing if the name exists in a file, if it does, then it will access that file and append to it the new value, if the file doesn't exist, it will create the file and then set it to append. I have a dictionary that has all the "append write file" things connected to it, so anytime I want to write something, it will grab the file name, and the append thing will be available in the dict, it will look it up and write to that file, so it doesn't keep opening new files everytime it runs.

The first 1k took less than a minute.. now I'm on 4k to 5k records, and it's running all ready 5 minutes.. it seems to take longer as it goes up in records, I wonder how to speed it up. It's not printing to the console at all.

writeFile = 1
theDict = {}

for row in ws.iter_rows(rowRange):
    for cell in row:
        #grabbing the value
        theStringValueLocation = "B" + str(counter)
        theValue = ws[theStringValueLocation].value
        theName = cell.value
        textfilename = theName + ".txt"

        if os.path.isfile(textfilename):
            listToAddTo = theDict[theName]
            listToAddTo.write("," + theValue)
            if counter == 1000:
                print "1000"
                st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')

        else:
            writeFileName = open(textfilename, 'w')
            writeFileName.write(theValue)
            writeFileName = open(textfilename, 'a')
            theDict[theName] = writeFileName
        counter = counter + 1

I added some time stamps to the above code, it is not there, but you can see the output below. The problem I'm seeing is that it is going up higher and higher each 1k run. 2 minutes the firs ttime, thne 3 minutes, then 5 minutes, then 7 minutes. By the time it hits 50k, I'm worried it's going to be taking an hour or something and it will be taking too long.

1000
2016-02-25 15:15:08
20002016-02-25 15:17:07
30002016-02-25 15:20:52
2016-02-25 15:25:28
4000
2016-02-25 15:32:00
5000
2016-02-25 15:40:02
6000
2016-02-25 15:51:34
7000
2016-02-25 16:03:29
8000
2016-02-25 16:18:52
9000
2016-02-25 16:35:30
10000

Somethings I should make clear.. I don't know the names of the values ahead of time, maybe I should run through and grab those in a seperate python script to make this go faster?

Second, I need a string of all values seperated by comma, that's why I put it into a text file to grab later. I was thinking of doing it by a list as was suggested to me, but I'm wondering if that will have the same problem. I'm thinking the problem has to do with reading off excel. Anyway I can get a string out of it seperated by comma, I can do it another way.

Or maybe I could do try/catch instead of searching for the file everytime, and if there is an error, I can assume to create a new file? Maybe the lookup everytime is making it go really slow? the If the file exists?

this question is a continuation from my original here and I took some suggestions from there.... What is the fastest performance tuple for large data sets in python?

The indentation of your code is wrong, which means we cannot tell how your loops are structured. Why are you iterating over cells and then anyway accessing a cell in column B of some row? I'm going to guess that `ws["B" + str(counter)].value` runs in linear time wrt. `counter`, not in constant time. — roeland, Feb 26 '16 at 01:39
no the indentation is not wrong. you got it.. i'm grabbing two values. the B one is where the "value" is. the "cell.value" is the name. i'm using the counter to grab the one on the same row as each other. — king, Feb 26 '16 at 04:18
oh yeah you are right.. there should be an indent after for cell in row.. i will fix it — king, Feb 26 '16 at 04:27
Can't tell what you are trying to do. It might help to provide a few rows/columns of the input spreadsheet and what one or two output files should look like. Also, how many files are you opening? — RootTwo, Feb 26 '16 at 07:42
The performance should be linear. If this isn't the case then you probably have a nested loop somewhere. — Charlie Clark, Feb 26 '16 at 09:09
@RootTwo it's around 20 files. the beginning is just opening the spreadsheet and all that. if the file doesnt exist with that "name", it wil lcreate a new file, then create an instance of write file inside a dictionary so i can access it, then everytime that name hits in the excel sheet, it knows to grab the IP, then it will add that ip to that instance of write file that has been opened for it — king, Feb 26 '16 at 14:44
You aren't closing your file handles, and in fact, your storing each file handle to the dict, this is most likely your problem. close your file handles and reopen them when needed vs leaving them open — pyInTheSky, Mar 05 '16 at 03:14
@king, first of all i would definitely switch to CSV files as it will work much-much faster. Beside that i would use Pandas for this task - it will simplify the whole thing dramatically. I did a test with a CSV file (185MiB, 12 columns, 2.1 millions rows, url: http://sdm.lbl.gov/fastbit/data/star2000.csv.gz) - i read one column and saved it to another CSV file, using Pandas - it took approx. 20 seconds on my slow home notebook. — MaxU - stand with Ukraine, Mar 06 '16 at 11:20
@king, i could write a working prototype if you would provide a sample CSV file (or a generator, which would generate such a file) and a sample of expected output file(s)... — MaxU - stand with Ukraine, Mar 06 '16 at 11:21

score 5 · Answer 1 · edited Feb 28 '16 at 14:54

5

I think what you're trying to do is get a key out of column B of the row, and use that for the filename to append to. Let's speed it up a lot:

from collections import defaultdict
Value_entries = defaultdict(list) # dict of lists of row data

for row in ws.iter_rows(rowRange):
    key = row[1].value

    Value_entries[key].extend([cell.value for cell in row])

# All done. Now write files:
for key in Value_entries.keys():
    with open(key + '.txt', 'w') as f:
        f.write(','.join(Value_entries[key]))

edited Feb 28 '16 at 14:54

Charlie Clark

18,477
4
49
55

answered Feb 28 '16 at 05:48

aghast

14,785
3
24
56

You don't need to calculate A1 style coordinates. `ws.cell(…)` accepts numeric (1-based) values for rows and columns. But you don't need either within iter_rows as you can rely on enumerate or simple indexing: cells from 'B' would be row[1], assuming the range starts at 'A'. – Charlie Clark Feb 28 '16 at 14:49
key = row[1].value IndexError: tuple index out of range – king Feb 29 '16 at 14:37
you are right.. i'm tyring to pull a value out of B.. i see you are doing that with row[1].value.. although i'm not sure why it's bringing an error – king Feb 29 '16 at 15:10
okay.. i see that it is only able to pull row[0], and that pulls the value only that is in the row range. for example, my row range is... rowRange = "K2:K"+ str(4000).. that is for the first 2k records, so i'm pulling row K out.... that is why it is bringing and error.. that's why i did the hacky way to grab B.. do you tihnk it's better to open the range up? – king Feb 29 '16 at 15:28
1

It's better to do whatever gets the job done quick. Try timing it both ways, see which one runs faster. – aghast Feb 29 '16 at 22:44
i dont understand exactly, is this creating a new list everytime or how is this working? on this line, you put another for loop inside of it? is this creating a bunch of different text files? it looks like it is using just one. there are 2 values being grabbed here. – king Mar 01 '16 at 18:54
Value_entries[key].extend([cell.value for cell in row]) – king Mar 01 '16 at 18:54
That line creates a temporary list, [cell.value for cell in row], and then uses that temporary to call Value_entries[key] (which is a dict lookup that returns a list) `.extend()` which appends the temp list to the value-entries list. – aghast Mar 01 '16 at 22:26

score 2 · Answer 2 · answered Feb 26 '16 at 09:13

2

It looks like you only want cells from the B-column. In this case you can use ws.get_squared_range() to restrict the number of cells to look at.

for row in ws.get_squared_range(min_col=2, max_col=2, min_row=1, max_row=ws.max_row):
    for cell in row: # each row is always a sequence
         filename = cell.value
         if os.path.isfilename(filename):
              …

It's not clear what's happening with the else branch of your code but you should probably be closing any files you open as soon as you have finished with them.

answered Feb 26 '16 at 09:13

Charlie Clark

18,477
4
49
55

the else branch is if the file doesnt exist, it creates the file, and then leaves it open so that i can keep adding " – king Feb 26 '16 at 14:42
," to the end of each file, it is so it doesnt have to keep reopening the file everytime, it just leaves them open, there are around 20 files that will open – king Feb 26 '16 at 14:42
1

Instead of providing an incomplete script here with timestamps, I suggest you use the profile module from the standard library to find out where you code is spending most of its time. – Charlie Clark Feb 26 '16 at 15:23
dude thats tight.. i didnt know they had that – king Feb 26 '16 at 15:49
is there a super fast way to set that up? seems compllicated everything im looking at – king Feb 26 '16 at 15:53
Well, to be honest your code is too horrible to look at. So you can probably profit a lot from a significant refactoring. openpyxl should take only a few seconds at most to do what it looks like you might want to do. – Charlie Clark Feb 26 '16 at 18:02
ok i removed the extra prints from there. hte reason it is complicated is becuase i have to separate each name and create a list of each value that corresponds with a name by a "," between each value... i don't know how else you would do it... that's why i asked................................... – king Feb 26 '16 at 18:47
i executed this.. it goes fast until it starts getting up higher in the 1000s, it goes slow again. i tried to write to file every 1k iterations, delete the list, then recreate the list again, but that did not work either.. still goes slow.. i thought programming should be faster then this lol..... – king Mar 01 '16 at 21:18
If you have non-linear performance then you are creating nested loops somewhere. Time to break your code down into parts that you can test reliably and then put them together. And only you can do this. At the moment you're trying to do too much that you don't understand at once. – Charlie Clark Mar 02 '16 at 08:23

score 2 · Answer 3 · answered Mar 03 '16 at 06:05

Based on the other question you linked to, and the code above, it appears you have a spreadsheet of name - value pairs. The name in in column A and the value is in column B. A name can appear multiple times in column A, and there can be a different value in column B each time. The goal is to create a list of all the values that show up for each name.

First, a few observations on the code above:

counter is never initialized. Presumably it is initialized to 1.
open(textfilename,...) is called twice without closing the file in between. Calling open allocates some memory to hold data related to operating on the file. The memory allocated for the first open call may not get freed until much later, maybe not until the program ends. It is better practice to close files when you are done with them (see using open as a context manager).
The looping logic isn't correct. Consider:

First iteration of inner loop:

for cell in row:                        # cell refers to A1
    valueLocation = "B" + str(counter)  # valueLocation is "B1"
    value = ws[valueLocation].value     # value gets contents of cell B1
    name = cell.value                   # name gets contents of cell A1
    textfilename = name + ".txt"
    ...
    opens file with name based on contents of cell A1, and
    writes value from cell B1 to the file
    ...
    counter = counter + 1                        # counter = 2

But each row has at least two cells, so on the second iteration of the inner loop:

for cell in row:                          # cell now refers to cell B1
    valueLocation = "B" + str(counter)    # valueLocation is "B2"
    value = ws[valueLocation].value       # value gets contents of cell B2
    name = cell.value                     # name gets contents of cell B1
    textfilename = name + ".txt"
    ...
    opens file with name based on contents of cell "B1"  <<<< wrong file
    writes the value of cell "B2" to the file            <<<< wrong value
    ...
    counter = counter + 1        # counter = 3 when cell B1 is processed

Repeat for each of 50K rows. Depending on how many unique values are in column B, the program could be trying to have hundreds or thousands of open files (based on contents of cells A1, B1, A2, B2, ...) ==>> very slow or program crashes.

iter_rows() returns a tuple of the cells in the row.
As people suggested in the other question, use a dictionary and lists to store the values and write them all out at the end. Like so (Im using python 3.5, so you may have to adjust this if you are using 2.7)

Here is a straight forward solution:

from collections import defaultdict

data = defaultdict(list)

# gather the values into lists associated with each name
# data will look like { 'name1':['value1', 'value42', ...],
#                       'name2':['value7', 'value23', ...],
#                       ...}
for row in ws.iter_rows():
    name = row[0].value
    value = row[1].value
    data[name].append(value)

for key,valuelist in data.items():
    # turn list of strings in to a long comma-separated string
    # e.g., ['value1', 'value42', ...] => 'value1,value42, ...'
    value = ",".join(valuelist)

    with open(key + ".txt", "w") as f:
        f.write(value)

hi sorry for missing the code on my end, nice write up, i will try some of it now. but one observation as i am working my way down, the cell value, i set a range for all values above this that i didnt post. so i set a range of D1:D50000 at the bottom. so the cell value constantly hits each cell value, and then the B value is auto generating each time based on the counter — king, Mar 03 '16 at 16:15
`append` adds a single item to the end of a list. `extend` is used when you are adding multiple items (such as a list of items). Here we are adding the single value, so `append` is appropriate. — RootTwo, Mar 03 '16 at 19:30
what if the value we are adding is an IP, so it would be like 123.123.123 — king, Mar 03 '16 at 19:30
The IP is still a single "thing". IIRC the values in your excel file are strings. Using extend would append each letter individually instead of the whole string at once: [...'1', '2', '3', '.', '1', ...] instead of [ ..., '123.123.123', ...] — RootTwo, Mar 03 '16 at 19:50

Fastest Way To Run Through 50k Lines of Excel File in OpenPYXL

3 Answers3

Linked