Remove duplicate lines based on starting pattern using bash

Question

I'm trying to remove duplicates in a list of Jira tickets that follow the following syntax:

XXXX-12345: a description

where 12345 is a pattern like [0-9]+ and the XXXX is constant. For example, the following list:

XXXX-1111: a description
XXXX-2222: another description
XXXX-1111: yet another description

should get cleaned up like this:

XXXX-1111: a description
XXXX-2222: another description

I've been trying using sed but while what I had worked on Mac it didn't on linux. I think it'd be easier with awk but I'm not an expert on any of them.

I tried:

sed -r '$!N; /^XXXX-[0-9]+\n\1/!P; D' file

Replacing `$0` with `$1` in the accepted answer to this [related question](https://stackoverflow.com/q/1444406/1331399) should do the trick — Thor, Dec 10 '20 at 16:48
@Thor Thanks! it worked. Could you explain the command to me, please? I understand the idea behind using awk '!seen' but I don't understand why $1 or how it identifies the pattern in my use case. — Juan Vega, Dec 10 '20 at 18:32
@JuanVega: awk splits each line into fields according to what `FS` is set to, it defaults to sequences of spaces and tabs. This splitting sets the positional variables `$1`, `$2`, ... accordingly, so `$1` is the first field, up-to the first space/tab — Thor, Dec 10 '20 at 18:37
@anubhava I was trying to use `sed -r '$!N; /^XXXX-[0-9]+\n\1/!P; D'` as I found another answer where it was used to delete duplicated lines. In the original answer instead of `XXXX-[0-9]+` there was `(.*)`. But it's sure I don't get how it works because it doesn't work. — Juan Vega, Dec 10 '20 at 18:37
@Thor Ok, now I understand. So in my case it works basically because there is always a space after `:`. So if I want to make it work by splitting by the first colon to avoid lines without whitespaces I should use `awk -F ':' '!seen[$1]`, right? I was confused because while searching for information I saw use cases that were using $0 instead of $1. — Juan Vega, Dec 10 '20 at 18:41

score 1 · Accepted Answer · answered Dec 10 '20 at 19:04

1

This simple awk should get the output:

awk '!seen[$1]++' file

XXXX-1111: a description
XXXX-2222: another description

answered Dec 10 '20 at 19:04

anubhava

761,203
64
569
643

1

Yes, I ended up using that one as also suggested by @Thor. Thanks! – Juan Vega Dec 11 '20 at 08:27

dawg · Answer 2 · 2020-12-10T19:06:39.953

0

If the digits are the only thing defining a dup, you could do:

awk -F: '{split($1,arr,/-/); if (seen[arr[2]]++) next} 1' file

If the XXXX is always the same, you can simplify to:

awk -F: '!seen[$1]++' file

Either prints:

XXXX-1111: a description
XXXX-2222: another description

edited Dec 10 '20 at 19:06

answered Dec 10 '20 at 19:01

dawg

98,345
23
131
206

Thanks! I keep that first one in mind if the characters end up changing at some point. – Juan Vega Dec 11 '20 at 08:36

potong · Answer 3 · 2020-12-11T12:51:49.677

0

This might work for you (GNU sed):

sed -nE 'G;/^([^:]*:).*\n\1/d;P;h' file

-nE turn on explicit printing and extended regexps.
G append unique lines from the hold space to the current line.
/^([^:]*:).*\n\1/d If the current line key already exists, delete it.
P otherwise, print the current line and
h store unique lines in the hold space

N.B. Your sed solution would work (not as is but with some tweaking) but only if the file(s) were sorted by the key.

sed -E 'N;/^([^:]*:).*\n\1/!P;D' file

edited Dec 11 '20 at 12:51

answered Dec 11 '20 at 12:32

potong

55,640
6
51
83

I didn't add the code but yes, I had lines sorted first before using my no solution. I'm curious, is the solution you propose the tweaking I would need? I'm not an expert on regex expressions so, what does that regex do exactly to only use the XXXX-1234 part in the comparison? – Juan Vega Dec 11 '20 at 15:00
Thanks for the explanation! – Juan Vega Dec 11 '20 at 16:27
@JuanVega in regexp you can group matching parts by enclosing them in parens. You can then refer to these grouping by a back reference which are numbered starting from the left most paren. e.g. /(aaa)(bbb)\1\2/ would match the string aaabbbaaabbb and /((aaa)bbb)\1\2/' would match aaabbbaaabbbaaa. Thus the regexp /^([^:]*:).*\n\1/ would match the same key twice and in the solution above, would delete that line. HTH BTW the first solution works sorted or unsorted the second only when sorted – potong Dec 12 '20 at 12:26

Remove duplicate lines based on starting pattern using bash

3 Answers3