0

I need to compare 2 lists and remove elements that appear in the first array from the second one efficiently. My solutions below:

Homebrew algorithm:

#function accepts 2 arrays. Compares elements from 1st array to 2nd to see if they exist in 2nd, 
#and removes them from 2nd array. 
def compareRemove(self, array1 = [], array2 = []):

    #starting iteration values
    i = 0
    j = 0

    array1_length = len(array1)
    array2_length = len(array2)

    while i < array1_length:
        while j < array2_length:

            #compare element from first array to second array, remove element if same, 
            #and update second array length
            if array1[i] == array2[j]:

                del array2[j]
                array2_length = len(array2)

            #otherwise move j forward
            else:
                j = j + 1

        #when 2nd while loop is done move i forward and reset j to 0
        i = i + 1
        j = 0

    return array2

suggested from research on SO:

updated_file_strings = [x for x in array2 if not x in array1]

Both are similarly fast, however as my array lenghts increase both will slow down. Are there any methods I can utilise that will perform well even if array lengths increase? Order doesnt matter.

EXAMPLE:

array1 = [1, 2, 3] array2 = [1, 2, 3, 5, 5, 5]

result_array = [5, 5, 5]

duplicates within the second array should NOT be removed.

asleniovas
  • 193
  • 3
  • 21
  • Please create a minimal, verifiable example and your expected output so we can better help you – Juan C Jan 29 '20 at 19:54

1 Answers1

4

Convert to set, remove elements, then convert back to list.

s1 = set(array1)
s2 = set(array2)
array2 = list(s2.difference(s1))

Edit: To keep track of duplicates, you can use collections.Counter and reconstruct the list.

from collections import Counter

s1 = set(array1)
array2 = [x for x in array2 if x not in s1]
# d2 = Counter(array2)
# array2 = [z for k, v in d2.items() if k not in s1 for z in [k] * v]

EDIT2: I thought using Counter would be faster, but the secondary list construction in the comprehension seems to nullify any gains. You are better off just making the first set, then using that for existence checks.

Tests: Counter and double comprehension

%%timeit
array1 = [random.randint(0, 10000) for _ in range(200000)]
array2 = [random.randint(0, 20000) for _ in range(200000)]
s1 = set(array1)
d2 = Counter(array2)
[z for k, v in d2.items() if k not in s1 for z in [k]*v]

# returns:
525 ms ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Test: single comprehension with existence check

%%timeit
array1 = [random.randint(0, 10000) for _ in range(200000)]
array2 = [random.randint(0, 20000) for _ in range(200000)]
s1 = set(array1)
#d2 = Counter(array2)
[x for x in array1 if x not in s1]

# returns:
510 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
James
  • 32,991
  • 4
  • 47
  • 70
  • I can't because set will remove any duplicates from array2. I need them to count frequencies. Which is why I compare to remove only same elements that exist in both arrays. The dupes that exist within array2 should not be removed. – asleniovas Jan 29 '20 at 19:51
  • See the update. – James Jan 29 '20 at 20:06
  • Thank you James, that helps a lot. Not sure why my question was closed its nowhere near a duplicate suggested by Daniel. Probably didn't even read the question... – asleniovas Jan 29 '20 at 20:10