Improve speed of iteration over 2 lists python

Question

I have a loop that iterates over 2 lists.

excel_data is a list that is created from an excel spreadsheet, the new data.

exists is a list that is created from a sql query, the existing data.

I need to compare the two and do a SQL UPDATE if the 2nd column of the 2d list excel_data has a changed value for the same record from exists.

Currently, the loop that is taking up the time is

excel_data = [[0 for x in range(2)] for y in range(20000)]
exists = [[0 for x in range(2)] for y in range(20000)]


for i in range(0, len(excel_data)):
    for j in range(0, len(exists)):
        if exists[j][0] in excel_data[i][0] and exists[j][1] != excel_data[i][1]:
           print("")

Since both lists are slightly less than 20,000 in size and will only increase as time goes on, this is the equivalent of iterating 400 million times. It completes after a few minutes, but I was wondering if there is any way I can substantially reduce the number of iterations or some other shorter way to get the index of a row which has a difference in the 2nd column.

Input data is like this:

Sam Adams   **********@gmail.com
Sammy Adams **********@gmail.com
Samuel Adams    **********@gmail.com
Samantha Adams  **********@gmail.com
Sam Adams   **********@gmail.com

I am searching to see if the 2nd column, in this case an email has any sort of change for the given name in the same row.

You can make use of appropriate datastructures like sets, or make use of some framework like pandas. — Hielke Walinga, Oct 08 '19 at 16:47
please provide a [mcve]. The easiest thing to do is to not iterate over the entire `exists`, and instead, create some data-structure (like a `dict` perhaps) that can map what you need to check in constant time. — juanpa.arrivillaga, Oct 08 '19 at 16:48
@donavin197300 we have essentially **nothing**. We should be able to copy and paste your [mcve] into a text-editor and run it locally and reproduce the behavior you are describing. Doing that would simply give various `NameError`s, because your code is **not complete** and therefore **not reproducible**. That is *your job* to provide when askinga question. In any case, though, your answer is not to use a `list` but a better suited data-structure, likely a `dict` in this case. — juanpa.arrivillaga, Oct 08 '19 at 16:52
Use/research PostgreSQL FWD ([Foreign data wrappers](https://www.postgresql.org/docs/current/fdwhandler.html)) ? Pretty sure [(third party) Excel/CSV wrappers](https://wiki.postgresql.org/wiki/Foreign_data_wrappers) already exists and maybe might be mature enough to also use indexes on that external data.. — Raymond Nijland, Oct 08 '19 at 17:26
... also [file_fdw](https://www.postgresql.org/docs/current/file-fdw.html) also exists by the way which can be used to read CVS files on the PostgreSQL server.. — Raymond Nijland, Oct 08 '19 at 17:34
@RaymondNijland Yes but would prefer not having to go into the Excel file and export as CSV. Ill check out the FWD and excel wrappers — , Oct 08 '19 at 17:45

Serge · Accepted Answer · 2019-10-09T14:46:19.543

With dictionaries:

excel_data = [[x for x in range(y, y + 2)] for y in range(20001)]
exists = [[x for x in range(y, y + 2)] for y in range(20000)]

exists[10][1] = 0  # let change at the least one element

d1 = dict(excel_data)
d2 = dict(exists)


d = {};
for k in d2:  # faster than a comprehension
    if d2[k] != d1[k]: 
        d[k] = d1[k], d2[k]

print (len(d))

As has been said in a comment, alternatively you can transform to sets of tuples (pairs), or for higher efficiency, frozenset, and compute set difference.

d = frozenset(d2.items()) - frozenset(d1.items())

The performance seems to be lower higher, yet the same order of magnitude ( and no constant complexity, of course), while dictionaries are somewhat more intuitive and closer to the original version.

As said in first comments, you can use a library such as Pandas to find the diff Outputting difference in two Pandas dataframes side by side - highlighting the difference

There are other frameworks for table slicing and dicing, e.g. petl https://petl.readthedocs.io/en/v0.24/case_study_1.html

PS Fun fact. Set version is slower.More after executing set version the following commands run slower until garbage is collected.

How would you get the value of both columns, right now it only gives output for column 1 — , Oct 08 '19 at 17:44
No clue why the downvote when this method is exponentially faster... anyways I will look for pandas, I had simply manually parsed the xlsx with openpyxl, probably not the best option and I can now see why. — , Oct 08 '19 at 18:00
Thanks, yet it is only polynomial faster so to say from you code. — Serge, Oct 08 '19 at 18:23

Improve speed of iteration over 2 lists python

1 Answers1