3

Can you think of a faster algorithm for this problem? Or improve the code?

Problem:

I have two customer IDs:

  • ID1 (e.g. phone number)
  • ID2 (e.g. email address)

A user sometimes change their ID1 and sometimes ID2. How can I find unique users?

Example:

ID1 = [7, 7, 8, 9]

ID2 = [a, b, b, c]

Desired result:

ID3 = [Anna, Anna, Anna, Paul]

enter image description here

The real world scenario has ca. 600 000 items per list.

There is already an SQL idea here: How can I match Employee IDs from 2 columns and group them into an array?

And I got help from a friend which had this idea with TypeScript: https://stackblitz.com/edit/typescript-leet-rewkmh?file=index.ts

A second friend of mine helped me with some pseudo-code, and I was able to create this:

Fastest (not working anymore) code so far:

ID1 = [7, 7, 8, 9]
ID2 = ["a", "b", "b", "c"]

def timeit_function(ID1, ID2):
    
    def find_user_addresses():
        phone_i = []
        email_i = []
        
        tmp1 = [ID1[0]]
        tmp2 = []
        tmp_index = []

        while len(tmp1) != 0 or len(tmp2) != 0:
            while len(tmp1) != 0:
                tmp_index = []  
                for index, value in enumerate(ID1):
                    if value == tmp1[0]:
                        tmp2.append(ID2[index])
                        tmp_index.insert(-1, index)

                for i in tmp_index: 
                    del ID1[i]
                    del ID2[i]
                tmp1 = list(dict.fromkeys(tmp1))
                phone_i.append(tmp1.pop(0))

            while len(tmp2) != 0:
                tmp_index = [] 
                for index, value in enumerate(ID2):
                    if value == tmp2[0]:
                        tmp1.append(ID1[index])
                        tmp_index.insert(0, index)

                for i in tmp_index: 
                    del ID1[i]
                    del ID2[i]
                tmp2 = list(dict.fromkeys(tmp2))
                email_i.append(tmp2.pop(0))

        return phone_i, email_i
    
    users = {}
    i = 0
    while len(ID1) != 0:
        phone_i, email_i = find_user_addresses()
        users[i] = [phone_i, email_i]
        i += 1
    return users

Output:

{0: [[7, 8], ['a', 'b']], 1: [[9], ['c']]}

Meaning: {User_0: [[phone1, phone2], [email1, email2]], User_1: [phone3, email3]}

Ranking

rank Username %timeit Unique users Correct output?
1. Zachary Vance 32 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 1408 yes
2. igrinis 5.54 s ± 81.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 1408 yes
(3.) dkapitan 8 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 3606 no
(4.) thenarfer 2.34 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 1494 no

The code was run with the two lists here (takes some time to load).

thenarfer
  • 405
  • 2
  • 14
  • I added the rankings as they look on my computer. If there are no objections, I will award the first bounty to @Zachary Vance. Thanks for your time and contribution! I might come back with another bounty on this in the future! – thenarfer Aug 04 '21 at 22:56

3 Answers3

4

The idea is very simple:

  • for each entry
    • scan all existing users,

      • if any dimension of the entry matches previous user, extend his attribute set, and add him to the merge list
    • if was not matched to anyone existing create a new user,

    • otherwise group distinct users together

This should be pretty fast, as it scans the list only once and does not require recursion.

ID1 = [7, 7, 8, 9]
ID2 = ["a", "b", "b", "c"]
user = {}
new_user_idx = 0
for i in range(len(ID1)):
    merge = []   # this is a list of users that should be merged
    for k in user:
        # find if any feature is already found in previous user  
        if ID1[i] in user[k][0]:
            user[k][1].add(ID2[i])
            merge.append(k)
        if ID2[i] in user[k][1]:
            user[k][0].add(ID1[i])
            merge.append(k)
    if not merge:
        # we have to create a new user
        user[new_user_idx] = (set([ID1[i]]), set([ID2[i]]))
        new_user_idx += 1
    elif len(merge)>1:
        # merging existing users
        for el in set(merge[1:]):
            if el==merge[0]: continue  # skip unnecessary merge
            user[merge[0]][0].update(user[el][0]) # copy attributes
            user[merge[0]][1].update(user[el][1])
            user.pop(el) # delete merged user
print(user)   


{0: ({7, 8}, {'a', 'b'}), 1: ({9}, {'c'})}
 
igrinis
  • 12,398
  • 20
  • 45
  • Thanks a lot for your code @igrinis! If it's the fastest sumitted answer, I will reward you the bounty. I got `6.28 µs ± 81 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)` – thenarfer Jul 29 '21 at 22:16
  • I tried both our codes with `ID1 = ['ccedf', 'dabdd', ..., n=1000] ID2 = [53412, 52266, ..., n=1000]`. When your code finds 1408 users, mine finds 1494 . https://gist.github.com/thenarfer/e4488e03965f61fc5bb139f8b7995186 (takes some time to load) – thenarfer Jul 30 '21 at 00:04
3

Trying it with sets:

%%timeit
ID1 = [7, 7, 8, 9]
ID2 = ["a", "b", "b", "c"]

users = {0: {'phone': set([ID1[0]]), 'email': set([ID2[0]])}}

for index, id1 in enumerate(ID1):
    id2 = ID2[index]
    
    # iterate over found list of users
    i = 0
    n_users = max(users.keys()) 
    while i <= n_users:
        if any([(id1 in users[i]['phone']), (id2 in users[i]['email'])]):
            users[i]['phone'].add(id1)
            users[i]['email'].add(id2)
            break
        i += 1

    # add new user if not found
    if i > n_users:
        users[i] = {'phone': set([id1]), 'email': set([id2])}
dkapitan
  • 859
  • 2
  • 10
  • 21
  • `4.68 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)`, which is a bit disappointing. I thought sets would be quicker than iterating through lists. – dkapitan Aug 02 '21 at 03:41
  • iteration in python should be equal to O(m) where m is the length of the iterator. There is a small amount of memory allocated to instantiating a set vs list, but it looks like your iterating through the whole set (with intersection op). – Yaakov Bressler Aug 02 '21 at 03:46
  • 1
    Just updated with comparing membership in sets @YaakovBressler. Still getting similar performance. This post compares sets vs lists and also states checking for membership should be faster with sets: https://stackoverflow.com/questions/2831212/python-sets-vs-lists – dkapitan Aug 02 '21 at 03:59
  • 1
    Thanks a lot for your answer! I ran the code with the lists in gist.github.com/thenarfer/e4488e03965f61fc5bb139f8b7995186 (takes some time to load), and 3606 unique users came up (My code finds 1494, and @igrinis finds 1408). I have no idea who is closer to the answer. – thenarfer Aug 02 '21 at 11:34
  • 1
    I imagine performance between data structure & operation will vary depending on the use case. But I agree with you – sets should be slightly quicker for this type of operation. Clever solution btw. – Yaakov Bressler Aug 02 '21 at 12:13
  • @thenarfer will have a look and see if I can fix my code to give the right answer (assuming it's 1408). – dkapitan Aug 03 '21 at 05:13
  • @thenarfer: can't figure out what's wrong in my code. I have ran it on the longer list, and did a check on duplicates using`set([frozenset(user['email']) for user in users.values()])`. There are no duplicates, so my result indeed is 3606 unique combinations. – dkapitan Aug 03 '21 at 07:03
  • Ran `%%timeit` with the full list, got `4.9 s ± 56.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – dkapitan Aug 03 '21 at 07:07
  • @dkapitan I highly appreciate your time and efforts and I hope we can figure it out. I will let you know if I figure out either one of our codes. I assume still, that 1408 is correct, because I have tested my own code a bit more, and can "manually" observe that my code does not "snake" enough through the lists to find all ways to group (this is for my code). – thenarfer Aug 03 '21 at 11:58
  • 1
    @thenarfer I think I have found why there is a difference. It has to do with the logic what you do if there is a new record, where both ID1 and ID2 already exists in different users. For example, at index 8909 we have ("abdeb", 44361). At this point, I have already found the users `(1835, {'email': {'abdeb'}, 'phone': {53135, 55234, 64522}})` and `(2839, {'email': {'fafac'}, 'phone': {44361}})`. I hadn't merged these, like @igniris has done. – dkapitan Aug 03 '21 at 21:42
  • 1
    My solution essentially becomes @igniris solution (also using sets) if I add this logic, so I expect there isn't any significant performance difference between our solutionns. – dkapitan Aug 03 '21 at 21:58
3

Don't run timeit with a 5-element list. This is not a valid method to evaluate the contenders. Use bigger lists (1000+) to test performance, or you will not actually end up with the fast program you want.

I'll throw my hat in the ring, with an O(N) worst-case algorithm (doesn't depend on the number of users).

I'll point out the performance of the other algorithms depends a lot on the number of IDs a typical user has. This one is specialized to work well for a large number of IDs per user.

ID1 = [7, 7, 8, 9]
ID2 = ["a", "b", "b", "c"]

from collections import defaultdict

id1_to_id2 = defaultdict(set)
id2_to_id1 = defaultdict(set)
for id1, id2 in zip(ID1, ID2):
    id1_to_id2[id1].add(id2)
    id2_to_id1[id2].add(id1)

id1_to_user = {}
users = {}
for id1 in id1_to_id2:
    if id1 in id1_to_user:
        continue # Already processed

    # Find all id1 and id2 for this user, using 'floodfill'
    id1s = {id1}
    id2s = set()
    id1_queue = [id1]
    while len(id1_queue) > 0:
        old_id2s = id2s.copy()
        for id1 in id1_queue:
            id2s.update(id1_to_id2[id1]) # try using id2_queue = set().union(id1_to_id2[id1] for id1 in id1_queue)-id2s; id2s |= id2_queue as well in case it's faster
        id2_queue = id2s - old_id2s
        id1_queue = []
        old_id1s = id1s.copy()
        for id2 in id2_queue:
            id1s.update(id2_to_id1[id2])
        id2_queue = []
        id1_queue = id1s - old_id1s
    user = len(users)
    users[user] = [id1s, id2s]
    for id1 in id1s:
        id1_to_user[id1] = user

print(users)
Zachary Vance
  • 752
  • 4
  • 18
  • 2
    Running the gist in gist.github.com/thenarfer/e4488e03965f61fc5bb139f8b7995186, I get 1408 users (same as igrinis) and this runs in about 0.1s clock time for me (50x faster than the run given, which i think is igrinis, and I think 500X slower than dkapitan?). – Zachary Vance Aug 02 '21 at 21:38
  • That's great! Thanks a lot for the time and contribution! – thenarfer Aug 02 '21 at 21:39
  • 1
    It seems that 1408 users is the correct answer. Running this on the github lists (len=10 000) I get a time of `32.6 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)`. – thenarfer Aug 02 '21 at 21:54