I have a database with two columns "a" and "b".
The columns contain two identifiers and this means that the two identifiers are linked together. The goal is to find an algorithm that assigns a number to each group of identifiers.
For example starting from this table:
a | b |
---|---|
1 | 2 |
0 | 3 |
4 | 5 |
2 | 4 |
6 | 7 |
3 | 8 |
9 | 10 |
5 | 1 |
8 | 0 |
7 | 6 |
10 | 9 |
We want to have :
a | b | Group |
---|---|---|
1 | 2 | 1 |
0 | 3 | 2 |
4 | 5 | 1 |
2 | 4 | 1 |
6 | 7 | 3 |
3 | 8 | 2 |
9 | 10 | 4 |
5 | 1 | 1 |
8 | 0 | 2 |
7 | 6 | 3 |
10 | 9 | 4 |
Because 1 is linked to 2 and 2 is linked to 4 and 4 is linked to 5 and 5 is linked to 1 : 1, 2, 4 and 5 are in the same group.
Any idea about an algorithm in Spark (ideally Pyspark) or SQL ? The database contains more than 180 millions of rows and the time of execution must be as short as possible...
I have already tried some algos in Python but the execution time is much too long...
Update:
@MrSmith42 Thank you for your help. My code in Python :
# Dict with result, for each id in 'a' -> a group id
result_dict = {}
group = 1
# Loop over the number of rows of the table
for i in range(len(data[0])):
if data[0][i] in result_dict:
# Important condition to change the group id off all ids already linked in previous iteration
# In our example, to handle the couple (4,5) that appears before the (2,4)
if (data[1][i] in result_dict) and (result_dict.get(data[0][i]) != result_dict.get(data[1][i])):
temp = result_dict.get(data[1][i])
for a, b in result_dict.items():
if b == temp:
result_dict[a] = result_dict.get(data[0][i])
else :
result_dict[data[1][i]] = result_dict.get(data[0][i])
elif data[0][i] not in result_dict and data[1][i] in result_dict:
result_dict[data[0][i]] = result_dict.get(data[1][i])
else :
result_dict[data[0][i]] = group
result_dict[data[1][i]] = group
group = group + 1
# Fill the group column in the array
for i in range(len(data[0])) :
data[2][i] = result_dict.get(data[1][i])
As you can see, there is some loop (that are maybe unnecessary) and with more than 180 millions rows (hosted with HDFS), the time of execution is too long :(