0

I have a database with two columns "a" and "b".

The columns contain two identifiers and this means that the two identifiers are linked together. The goal is to find an algorithm that assigns a number to each group of identifiers.

For example starting from this table:

a b
1 2
0 3
4 5
2 4
6 7
3 8
9 10
5 1
8 0
7 6
10 9

We want to have :

a b Group
1 2 1
0 3 2
4 5 1
2 4 1
6 7 3
3 8 2
9 10 4
5 1 1
8 0 2
7 6 3
10 9 4

Because 1 is linked to 2 and 2 is linked to 4 and 4 is linked to 5 and 5 is linked to 1 : 1, 2, 4 and 5 are in the same group.

Any idea about an algorithm in Spark (ideally Pyspark) or SQL ? The database contains more than 180 millions of rows and the time of execution must be as short as possible...

I have already tried some algos in Python but the execution time is much too long...

Update:

@MrSmith42 Thank you for your help. My code in Python :

# Dict with result, for each id in 'a' -> a group id
result_dict = {}
group = 1

# Loop over the number of rows of the table
for i in range(len(data[0])):
    if data[0][i] in result_dict:
        # Important condition to change the group id off all ids already linked in previous iteration
        # In our example, to handle the couple (4,5) that appears before the (2,4)
        if (data[1][i] in result_dict) and (result_dict.get(data[0][i]) != result_dict.get(data[1][i])):
            temp = result_dict.get(data[1][i])
            for a, b in result_dict.items():
                if b == temp:
                    result_dict[a] = result_dict.get(data[0][i])
        else :
            result_dict[data[1][i]] = result_dict.get(data[0][i])
    elif data[0][i] not in result_dict and data[1][i] in result_dict:
        result_dict[data[0][i]] = result_dict.get(data[1][i])
    else :
        result_dict[data[0][i]] = group
        result_dict[data[1][i]] = group
        group = group + 1
        
# Fill the group column in the array        
for i in range(len(data[0])) : 
    data[2][i] = result_dict.get(data[1][i])

As you can see, there is some loop (that are maybe unnecessary) and with more than 180 millions rows (hosted with HDFS), the time of execution is too long :(

ZF007
  • 3,708
  • 8
  • 29
  • 48
Jed
  • 11
  • Please share what you have tried 'I have already tried some algos in Python' maybe we can point out some improvements. – MrSmith42 Oct 05 '21 at 07:56
  • 1
    This is finding **connected components** of a graph i.e. subgraphs where each element is connected through a path.. Checkout answer to [Disjoint sets on apache spark](https://stackoverflow.com/questions/37297149/disjoint-sets-on-apache-spark) – DarrylG Oct 05 '21 at 08:13
  • Thank you very much @DarrylG, I'll look into it – Jed Oct 05 '21 at 09:23

0 Answers0