-1

I have a class in python that contains a list, that can be updated by calling a function. This list has several associated lists, which are all updated whenever the list is updated. However, one of the lists has roughly 40,000 items, and so when it is updated and all the related lists are updated, and it takes a very long time! Instead of remaking all the lists associated with the new list every time I update the list, I want to get what has been added to the main list and what has been removed, and only update these specific parts of the list.

Below is some code that gets what items have been removed, where self.buildings is the list that contains the old list, and globalvars[self.name] gets the current list.

def updatelist(self):

    globalvars = globals()
    removed = []
    for building in self.buildings:
      if building not in globalvars[self.name]:
        removed.append(building)

However, while this code functions, it takes a very long time for the list with 40,000 items, and I am certain that I could just iterate over the list and find which items have been removed without using in, which has a time complexity that makes this unusable.

Finally, I want some code which can also identify what has been appended to the list. Something will never be inserted into the list at a random place, so there will only by new items appearing at the end which should make it easier to code.

A sample input would be:

self.buildings = [[3, 2, 3, 5], [0, 0, 1, 1], [1, 1, 5, 5], [6, 2, 3, 3]]
globalvars[self.name] = [[0, 0, 1, 1], [1, 1, 5, 5], [8, 2, 6, 3], [6, 2, 7, 0]]

An ideal output would be:

removed = [[3, 2, 3, 5], [6, 2, 3, 3]]
appended = [[8, 2, 6, 3], [6, 2, 7, 0]]

How would I achieve this in an optimised way?

Thanks in advance.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Yetiowner
  • 3
  • 2
  • 1
    Thanks for the *detail* problem description, but it'll be helpful to see your sample *inputs* data (minimally , of course) And desired outputs. – Daniel Hao Jul 11 '22 at 13:28
  • i think you need to use the [`set`](https://docs.python.org/3/library/stdtypes.html#set) class, as it has difference between two sets which is exactly what you want, and it does the loops on C level, which is much faster and optimized, but sorted ascendingly, not in the order of appending. (equivalent of calling `sorted` on the list) – Ahmed AEK Jul 11 '22 at 13:37
  • 1
    @AhmedAEK Sets generally don't sort. – Kelly Bundy Jul 11 '22 at 13:46
  • The `globals` and `self` stuff aren't really needed for the question, are they? – Kelly Bundy Jul 11 '22 at 13:47
  • To be sure: Is the data really lists of ints? – Kelly Bundy Jul 11 '22 at 13:49
  • i think the main problem in using sets is that his elements are lists, which aren't hashable, so maybe changing them to tuples instead of lists to make them hashable to make the whole process scale with `nlogn` instead of `n**2`. – Ahmed AEK Jul 11 '22 at 13:49
  • @AhmedAEK What do sets have to do with nlogn? – Kelly Bundy Jul 11 '22 at 13:50
  • i was under the impresion that set has a `logn` complexity for searching, apparently it's better than `logn` so the code will be faster than `nlogn`, but constructing the set would take some time. – Ahmed AEK Jul 11 '22 at 14:00
  • 2
    @AhmedAEK Yes, unless you're under attack, you can consider building it O(n) and membership tests O(1). – Kelly Bundy Jul 11 '22 at 14:03
  • 1
    @DanielHao I mean if you let me enter ints that you put in a set, I can (and likely will :-) give you ints that make your set creation take quadratic time. For strings, such attacks were deemed enough of a danger that Python now randomizes string hashing to prevent that (I think it's still possible, but turned from easy to extremely hard). – Kelly Bundy Jul 11 '22 at 14:38
  • 1
    @DanielHao [Answer](https://stackoverflow.com/a/27522708/12671057) about attacks. – Kelly Bundy Jul 11 '22 at 14:47
  • Appreciate the great reference. ^^^ Have good reading now... – Daniel Hao Jul 11 '22 at 14:51

1 Answers1

1

as i was discussing in the comments section, using sets will speed up your code, but your entries have to be hashable, if the input is in fact list of lists, then converting it to a set of tuples would do the trick, otherwise if it's a complicated class then you must define a __hash__ function for it.

the below code shows the use of sets for the content in the question.

from random import randint

# construct the large random arrays
large_list_size = 10000
small_list_size = 4
first_list = []
second_list = []
first_list.append([1,1,1,1])
second_list.append([1,1,1,1])
for i in range(large_list_size):
    small_list_instance = [1,1,1,1]
    while small_list_instance in first_list:
        small_list_instance = []
        for j in range(small_list_size):
            small_list_instance.append(randint(0,10))
    first_list.append(small_list_instance)

    small_list_instance = [1, 1, 1, 1]
    while small_list_instance in second_list:
        small_list_instance = []
        for j in range(small_list_size):
            small_list_instance.append(randint(0, 10))
    second_list.append(small_list_instance)


# list implementation in question
def updatelist(first,second):

    removed = []
    for building in first:
      if building not in second:
        removed.append(building)
    return removed

# sets implemenation
def updatelist_sets(first,second):
    first_set = set([(*x,) for x in first])
    second_set = set([(*x,) for x in second])
    removed = first_set.difference(second_set)
    return removed

# measure lists time
import time
retries = 10
t1 = time.time()
for i in range(retries):
    a = updatelist(first_list,second_list)
t2 = time.time()
time_lists = (t2-t1)/retries
print('using lists = ',time_lists)

# measure sets time
t1 = time.time()
for i in range(retries):
    b = updatelist_sets(first_list,second_list)
t2 = time.time()
time_sets = (t2-t1)/retries
print('using sets = ',time_sets)
print('speedup = ',time_lists/time_sets)

# convert b to be like a in structure
b = [list(x) for x in b]

# compare equality of both outputs
if sorted(a) == sorted(b):
    print('both are equal')

you get

using lists =  1.0185745239257813
using sets =  0.004686903953552246
speedup =  217.32353255367963
both are equal

needless to say, most of the time is spent constructing the set of tuples, so if you change your original data structure to a set of hashable objects, then the code is going to be even faster.

Edit: fixed last comparison as @Kelly Bundy suggested.

Ahmed AEK
  • 8,584
  • 2
  • 7
  • 23