1

I have two lists:

  • old_data has a length of 180,000.
  • updated_data has a length of 184,000.
  • Both lists contain IDs which are string values.
  • There is a lot of overlap in IDs stored within the lists but I need to know which ones are no longer stored in updated_data so that I can say those IDs are not longer active in an SQL server.

Therefore, I need to check for any items in old_data that are not in updated_data and save them to a separate list, let's call it inactive_data.

I have the following code currently which is highly time-consuming and inefficient:

# Initialize list for no-longer active IDs to be saved into.
inactive_data = []

# Iteratively check if all IDs in old data are still in updated data.
# If they are not, add them to the list.
for Id in old_data:
    
    if Id not in updated_data:
        
        inactive_data.append(Id)

To run this code, it took approx. 45 minutes. I would to know how I could drastically reduce the time. If you have any suggestions please let me know!

Joey O'Neill
  • 124
  • 1
  • 6
  • 4
    If the order is not important you can convert the lists to sets and find the difference: `set(old_data) - set(updated_data)` – Mark Jul 21 '22 at 03:09
  • [Just to provide an example of using sets](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEoMS8lP5eLKz8nJT4lsSRRwVYhGiKmB6Iy80o0DHQMDaBAUyEtv0ghXiEzD6QxPVXD0AIsHMtVWgDUnUqeGQZQM7gKikBqc1LzNIpTSzRgTtLUBfGQLdDU1Pz/HwA) – 0x263A Jul 21 '22 at 03:13
  • 2
    @Mark that just reduced it to 0.5 seconds. That blows my mind. Thank you so much. – Joey O'Neill Jul 21 '22 at 03:31

2 Answers2

0

The usual and easy way is to turn one or both of the list into a set.

An alternate way is to sort both lists and compare them element by element. Once they're ordered, the comparison is quick because it's O(n).

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
0

The trick is to use a set(), sortedcontainers and then use set operations to do the diff. If your dataset is more complex, you might need to index id using a dict, then use a set of the keys.

# from sortedcontainers import SortedSet, SortedDict
old_data      = set(map(str,range(180000)))
updated_data  = set(map(str,range(184000)))
inactive_data = updated_data - old_data
active_data   = updated_data & old_data
print("inactive_data: ", len(inactive_data), set(list(inactive_data)[0:10]))
print("active_data:   ", len(active_data),   set(list(active_data)[0:10]))
inactive_data:    4000 {'181439', '181023', '183834', '180575', '182003', '183226', '180134', '183697', '181968', '181804'}
active_data:    180000 {'54276', '128822', '91802', '8678', '118826', '97510', '22786', '30341', '88711', '137764'}

time 0.609s
  • NOTE: sortedcontainers.SortedSet is half the speed of unordered set()
James McGuigan
  • 7,542
  • 4
  • 26
  • 29