2

i have a file that looks like

1254543534523233434
3453453454323233434
2342342343223233535
0909909092324243535

Is there a way / command in bash to remove duplicates on the file above, based on a specific substring, without changing their order in the output?

ie

(With substring -> ${line:11:8}

1254543534523233434
2342342343223233535
0909909092324243535

I know that :

sort -u : sorts them numerically, then removes duplicates
sort -kx,x -u : The same
cat filein | uniq : requires them to be sorted already or it will not work

I m trying to figure out if there's a native linux solution without having to resolve to perl code for it. Thank you in advance.

onlyf
  • 767
  • 3
  • 19
  • 39
  • This is not an exact duplicate. It has the additional constraint of comparing lines based only on a subtring, but printing the complete line. However, the [answer](http://stackoverflow.com/questions/1444406/how-can-i-delete-duplicate-lines-in-a-file-in-unix) should be easily extendible to `awk '!seen[substr($0, 11, 8)]++' file.txt`. – Martin Nyolt Aug 22 '16 at 09:56

1 Answers1

7

You can use awk without any need to sorting:

awk '!uniq[substr($0, 12, 8)]++' file

1254543534523233434
2342342343223233535
0909909092324243535
  • Since awk index starts from 1 you need to use substr($0, 12, 8) to get desired 8 characters long text starting from 12th position.
  • uniq is an associative array with substring retrieved using substr function.
  • ++ sets value of array as 1
anubhava
  • 761,203
  • 64
  • 569
  • 643