0

I have a loop that iterates over 2 lists.

excel_data is a list that is created from an excel spreadsheet, the new data.

exists is a list that is created from a sql query, the existing data.

I need to compare the two and do a SQL UPDATE if the 2nd column of the 2d list excel_data has a changed value for the same record from exists.

Currently, the loop that is taking up the time is

excel_data = [[0 for x in range(2)] for y in range(20000)]
exists = [[0 for x in range(2)] for y in range(20000)]


for i in range(0, len(excel_data)):
    for j in range(0, len(exists)):
        if exists[j][0] in excel_data[i][0] and exists[j][1] != excel_data[i][1]:
           print("")

Since both lists are slightly less than 20,000 in size and will only increase as time goes on, this is the equivalent of iterating 400 million times. It completes after a few minutes, but I was wondering if there is any way I can substantially reduce the number of iterations or some other shorter way to get the index of a row which has a difference in the 2nd column.

Input data is like this:

Sam Adams   **********@gmail.com
Sammy Adams **********@gmail.com
Samuel Adams    **********@gmail.com
Samantha Adams  **********@gmail.com
Sam Adams   **********@gmail.com

I am searching to see if the 2nd column, in this case an email has any sort of change for the given name in the same row.

  • 2
    You can make use of appropriate datastructures like sets, or make use of some framework like pandas. – Hielke Walinga Oct 08 '19 at 16:47
  • 2
    please provide a [mcve]. The easiest thing to do is to not iterate over the entire `exists`, and instead, create some data-structure (like a `dict` perhaps) that can map what you need to check in constant time. – juanpa.arrivillaga Oct 08 '19 at 16:48
  • What exactly don't you have to reproduce? –  Oct 08 '19 at 16:49
  • 1
    @donavin197300 we have essentially **nothing**. We should be able to copy and paste your [mcve] into a text-editor and run it locally and reproduce the behavior you are describing. Doing that would simply give various `NameError`s, because your code is **not complete** and therefore **not reproducible**. That is *your job* to provide when askinga question. In any case, though, your answer is not to use a `list` but a better suited data-structure, likely a `dict` in this case. – juanpa.arrivillaga Oct 08 '19 at 16:52
  • Use/research PostgreSQL FWD ([Foreign data wrappers](https://www.postgresql.org/docs/current/fdwhandler.html)) ? Pretty sure [(third party) Excel/CSV wrappers](https://wiki.postgresql.org/wiki/Foreign_data_wrappers) already exists and maybe might be mature enough to also use indexes on that external data.. – Raymond Nijland Oct 08 '19 at 17:26
  • ... also [file_fdw](https://www.postgresql.org/docs/current/file-fdw.html) also exists by the way which can be used to read CVS files on the PostgreSQL server.. – Raymond Nijland Oct 08 '19 at 17:34
  • @RaymondNijland Yes but would prefer not having to go into the Excel file and export as CSV. Ill check out the FWD and excel wrappers –  Oct 08 '19 at 17:45

1 Answers1

-1

With dictionaries:

excel_data = [[x for x in range(y, y + 2)] for y in range(20001)]
exists = [[x for x in range(y, y + 2)] for y in range(20000)]

exists[10][1] = 0  # let change at the least one element

d1 = dict(excel_data)
d2 = dict(exists)


d = {};
for k in d2:  # faster than a comprehension
    if d2[k] != d1[k]: 
        d[k] = d1[k], d2[k]

print (len(d))

As has been said in a comment, alternatively you can transform to sets of tuples (pairs), or for higher efficiency, frozenset, and compute set difference.

d = frozenset(d2.items()) - frozenset(d1.items())

The performance seems to be lower higher, yet the same order of magnitude ( and no constant complexity, of course), while dictionaries are somewhat more intuitive and closer to the original version.

As said in first comments, you can use a library such as Pandas to find the diff Outputting difference in two Pandas dataframes side by side - highlighting the difference

There are other frameworks for table slicing and dicing, e.g. petl https://petl.readthedocs.io/en/v0.24/case_study_1.html

PS Fun fact. Set version is slower.More after executing set version the following commands run slower until garbage is collected.

Serge
  • 3,387
  • 3
  • 16
  • 34
  • How would you get the value of both columns, right now it only gives output for column 1 –  Oct 08 '19 at 17:44
  • I included the columns values – Serge Oct 08 '19 at 17:50
  • No clue why the downvote when this method is exponentially faster... anyways I will look for pandas, I had simply manually parsed the xlsx with openpyxl, probably not the best option and I can now see why. –  Oct 08 '19 at 18:00
  • Thanks, yet it is only polynomial faster so to say from you code. – Serge Oct 08 '19 at 18:23