Grouping common elements using awk

Question

The following table illustrates a brief snapshot of the data that I wish to manipulate. I am looking for an awk script that will group similar elements into one group. For eg. if you look at the table below:

Numbers (1,2,3,4,6) should all belong to one group. So row1 row2 row4 row8 will be group "1"
Number 9 is unique and does not have any common elements. So it will reside alone in a separate group say group 2
Similarly numbers 5,7 will reside in one group say group 3 and so on...

The file:

heading1        heading2         numberlist     group
name1           text             1,2,3          1
name2           text             2              1
name3           text             9              2
name4           text             1,4            1
name5           text             5,7            3
name6           text             7              3
name7           text             8              4
name8           text             6,2            1

I was searching for queries similar to mine and found this link. Grouping lists by common elements. But the solution is in C++ and not awk, which is my primary requirement.

Incidentally I also found this awk solution that is somewhat related to my query but it was devoid of handling of comma separated values. awk script grouping with array

Numberlist i.e. $3 is my only consideration for grouping.

abasu · Accepted Answer · 2013-05-09T15:46:42.117

This problem seemed almost same as one of my problems and i had used one column in your example to solve my problem :) So...

[[bash_prompt$]]$ cat log ; echo "########"; \
> cat test.sh ;echo "########";  awk -f test.sh log
heading1        heading2         numberlist     group
name1           text             1,2,3
name2           text             2
name3           text             9
name4           text             1,4
name5           text             5,7
name6           text             7
name7           text             8
name8           text             6,2
########
/^name/{
  i=0; j=0;
  split($3,a,",");
  for(var in a) {
    for(var1 in q) {
      split(q[var1],r,",");
      for(var2 in r) {
        if(r[var2] == a[var]) {
          i=1;
          j=((var1+1));
        }
      }
    }
  }
    if(i == 0) {
      q[length(q)] = $3;
      j=length(q);
    }
  print $1 "\t\t" $2 "   \t\t" $3 "\t\t" j;
}
########
name1           text            1,2,3           1
name2           text            2               1
name3           text            9               2
name4           text            1,4             1
name5           text            5,7             3
name6           text            7               3
name7           text            8               4
name8           text            6,2             1
[[bash_prompt$]]$

Update:

split splits the first argument by the delimiter passed in third argument and puts it into an array pointed by the second argument. Here main array is q, which holds the group members of a group, it's basically an array of arrays where the index of an element is the group id, and the element is collection all the members of the group. so q[0]="1,2,3" indicates 0th group is containing members 1,2 and 3. Now in awk, first one line is read which starts with name (/^name/). Then the 3rd field (1,2,3) is broken down into an array a. Now for each element in an array a, we go for each group stored into q (for(var1 in q)) , then inside each group, we split them into another temporary array r (split(q[var1],r,",")), i.e. "1,2,3" is split into an array r. Now each element in r is compared to the element in a. if a match found, the group's index is the index of that row (array index starts from 0, group's from 1, so ((var1+1)) used. Now if not found, just add this as a new group in q and the last index + 1, i.e. length of the array is the index for the row

Update:

/^name/{
  j=0;
  split($3,a,",");
  for(var in a) {
    if(q[a[var]] != 0) {
      j=q[a[var]]; i=1;
      break;
    }
  }
  j = (j == 0) ? ++k : j;
  for(var1 in a) {
    if(q[a[var1]] == 0) {
      q[a[var1]] = j;
    }
  }
  print $1 "\t\t" $2 "   \t\t" $3 "\t\t" j;
}

Update:

base is awk has associative array and each element is accessed by a string key. Earlier approach was to store each group in an array where key is the index of the group. So when we were reading a column, we will read each group, split the group in individual element, compare each of the element with each element of the column. But instead of storing a group, if we store the elements in an array where key is the element themselves and value at key is the index of the group to which the element belongs. So when we read a column, we split the column in individual element (split($3,a,",");) then check element in array if there is a group index with the element as key in if(q[a[var]] != 0)( in awk, if the element is not there, by default an element with value 0 is initialized there, so the check q[a[var]] != 0 ). If any element is found, we take the element's group index as the index of the column and break. else j will remain 0. if j remains 0, ++k gives the latest group index. Now we found the group index for the column elements. Need to carry that index to those elements which are not a part of any other group( there will be cases where multiple elements in same column belongs to different group, here we are taking the first come, first serve approach, but do not over write the group index of others already belonging to another group). So for each element in column (for(var1 in a)) , if it does not belong to a group (if(q[a[var1]] == 0)) , give it a group index q[a[var1]] = j;. So here all accesses are linear because we are accessing using elements directly a key. Thus no breaking up a group again and again for every element and hence a shorter time. My first approach was based on one of my own problem ( i mentioned in first line ) which was more complex processing but shorter data set. But this one required a simpler straight forward logic.

Thanks,abasu. The solution worked like a charm. This was only a small snapshot of the data that am going to process. The actual data will be close to 20million entries. Am hoping it will withstand the pressure. I will do the testing and let you know. Meanwhile can you document your code. I found the inner loop a little tricky to understand. — user676500, May 06 '13 at 06:59
Updated, if working please mark it as accepted, if not, let us know. If big data set is running, use `break` once you have found a match to reduce runtime — abasu, May 06 '13 at 12:39
21 million entries of test data got processed under 7mins flat!! Huge achievement. — user676500, May 07 '13 at 06:57
Well well...my grouping parameters are not single digit numbers but has a minimum of 7-8 digits almost like phone numbers. This is slowing it down. 40 mins now and only processed 60k of 3million entries! — user676500, May 07 '13 at 10:33
well you are trying to cook too much :) tried `if big data set is running, use break` ?? or if you have understood your data pattern, you can figure out how data pattern can be exploited. just switching the outer and inner loop can cause a huge difference if the data pattern supports — abasu, May 07 '13 at 11:04
Yes, I used the break.That was a natural thing to do. You are right. Further enhancements have to be based on the pattern. I am already working on that. Am also hoping for a better logic if there is any. Anyway you have provided me a start and I will take it from there. Grateful for that. — user676500, May 07 '13 at 12:12
i was thinking about your problem, in this any element in 3rd column, if they have a group index, that is the row's index. Now it boils down to `random element -> does it have an index? yes -> row index, no -> assign the last index` so to verify the `random element` factor, we are breaking down a linear array again and again. Perhaps moving to perl and using a hash is a better idea — abasu, May 07 '13 at 12:59
I updated the answer with slight modification to reduce number of iteration. Can you try it on your data and let me know if it is faster? — abasu, May 08 '13 at 10:59
Basu, this is lightening fast. It got completed in under a min. Bloody awesome. I can take this solution anyday. Now last but not the least, could you explain what you did here? Thanks a ton for the help. — user676500, May 09 '13 at 10:18
I think I jumped the gun. It has grouped unusually fast but has ended up grouping dissimilar elements.Atleast I have found 1 group that has around 200,000 elements which is impossible with this data. I am scanning further, will update. — user676500, May 09 '13 at 10:57
Yes, i think there is a fundamental difference between the 2 scripts. Suppose a group is `1,2,3` this gets a group index of 1, now `3,4` comes, So this also gets an index of 1, because 3 is present in group 1. So 4 is also a part of the group 1 or not? I mean if next `4,5` comes, then is it 1 or is it 2? — abasu, May 09 '13 at 11:32
Sorry i forgot to mention, you will see `i=1` is there in the inner loop. That actually is not being used anywhere. So if you want the first script's output, just put the for loop after `j = (j == 0) ? ++k : j;` inside an `if(i != 1)`. So if `1,2,3` is group1, `3,4,5` will be also group1, but `4,5,6` will be group2, i.e. the group indexes will not be carry forwarded by the group members. — abasu, May 09 '13 at 12:44
4,5,6 should also be part of group1 and not group2. It was my mistake. My testfile was using $3 column and my main file was using $2. That was the confusion. I got it corrected. Its working fine now. I think am good to go with this solution. Thanks for the effort. If you can explain the steps that will be great, though I have already started to put in the efforts to understand the logic. — user676500, May 09 '13 at 14:47
I updated with a huge explanation :) I'm happy it started working :) — abasu, May 09 '13 at 15:47

Grouping common elements using awk

1 Answers1