Uniq but only on part of the string

Question

I have strings such as:

import a.b.c.d.f.Class1
import a.b.g.d.f.Class2
import a.b.h.d.f.Class3
import z.y.x.d.f.Class4
import z.y.x.d.f.Class5
import z.y.x.d.f.Class6

I want to get all unique occurrences of the first part of the String. More specifically up to the third period. So I do:

grep "import curam" -hr --include \*.java | sort | gawk -F "." '{print $1"."$2"."$3}' | uniq

which gives me:

  import a.b.c
  import a.b.g
  import a.b.h
  import z.y.x

However, I'd like to get the full String for the first occurrence when the String up until the third period was unique. So, I want to get:

import a.b.c.d.f.Class1
import a.b.g.d.f.Class2
import a.b.h.d.f.Class3
import z.y.x.d.f.Class4

Any ideas?

score 3 · Accepted Answer · edited May 23 '17 at 11:58

3

Just keep track of the unique 2nd field:

awk -F '[ .]' '!uniq[$2]++' file

That is, start by setting the field separators to either a space or a dot. This way, the second field is always the first word in the dot-separated name:

$ awk -F '[ .]' '{print $2}' file
a
a
a
z
z
z

Then, just check when they appear for the first time:

$ awk -F '[ .]' '!uniq[$2]++' file
import a.b.c.d.f.Class1
import z.y.x.d.f.Class4

There are some subtle variations on the first three tokens between the String so I need to do just [.] Can't do space. I updated the question.

So if you have:

import a.b.c.d.f.Class1
import a.b.g.d.f.Class2
import a.b.h.d.f.Class3
import z.y.x.d.f.Class4
import z.y.x.d.f.Class5
import z.y.x.d.f.Class6

Then you need to split the second .-separeted field and check when the first three slices are repeated. This can be done using the same approach as above, only that using split() and then using the three first slices to check the uniqueness:

$ awk '{split($2, a, ".")} !uniq[a[1] a[2] a[3]]++' file
import a.b.c.d.f.Class1
import a.b.g.d.f.Class2
import a.b.h.d.f.Class3
import z.y.x.d.f.Class4

edited May 23 '17 at 11:58

Community

1
1

answered Jul 29 '16 at 09:42

fedorqui

275,237
103
548
598

There are some subtle variations on the first three tokens between the String so I need to do just [.] Can't do space. I updated the question. – More Than Five Jul 29 '16 at 09:50
@MoreThanFive then `awk -F '[ .]' '!uniq[$2 $3 $4]++' file` should make it. If space is not allowed, then you need to call `split()` on `$2` and then use the 1st, 2nd and 3rd elements of the array that gets created. – fedorqui Jul 29 '16 at 09:52
1

@MoreThanFive for the record: I updated the question to show what I explained in comments. – fedorqui Jul 29 '16 at 12:53
If you want the last one instead of the first one, one technic would be to use `tac` (before then after) to reverse the lines. – lolesque Jun 09 '23 at 16:39

Uniq but only on part of the string

1 Answers1