sed/awk + regex delete duplicate lines where first field matches (ip address)

Question

I need a solution to delete duplicate lines where first field is an IPv4 address.For example I have the following lines in a file:

192.168.0.1/text1/text2
192.168.0.18/text03/text7
192.168.0.15/sometext/sometext
192.168.0.1/text100/ntext
192.168.0.23/othertext/sometext

So all it matches in the previous scenario is the IP address. All I know is that the regex for IP address is:

\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b

It would be nice if the solution is one line and as fast as possible.

reader, What does a 'duplicate' mean to you? i.e. since you specifically mentioned about matching IP addresses, it appears to me that you want to keep only 1 line per IP address. Is this correct? If not, then like @jcollado mentions, you should just use sort. — ArjunShankar, Feb 13 '12 at 09:36
@ArjunShankar I guess he might want to remove duplicated lines, only if the $1 is ip address. so if there are two(+) lines like "abcdefg", they won't be removed. but this is just my guess. — Kent, Feb 13 '12 at 09:55
@Kent: You could be right. Our confusion basically occurs because *all* lines in the example are IP addresses *and* there are no real duplicates except matching IPs. — ArjunShankar, Feb 13 '12 at 09:57

ArjunShankar · Accepted Answer · 2012-02-13T09:35:07.320

6

If, the file contains lines only in the format you show, i.e. first field is always IP address, you can get away with 1 line of awk:

awk '!x[$1]++' FS="/" $PATH_TO_FILE

EDIT: This removes duplicates based only on IP address. I'm not sure this is what the OP wanted when I wrote this answer.

edited Feb 13 '12 at 09:35

answered Feb 13 '12 at 09:25

ArjunShankar

23,020
5
61
83

+1 for shortest solution that preserves the original order as well. – anubhava Feb 13 '12 at 09:30
+1 no matter how OP define his "duplicated lines", this solution could be extended easily to match his needs. the classic !a[$n]++ usage to remove duplicates. – Kent Feb 13 '12 at 10:02
the solution is perfect! exactly what i needed! thank's alot for reply. – reader Feb 13 '12 at 10:33

score 0 · Answer 2 · answered May 17 '14 at 02:49

The awk that ArjunShankar posted worked wonders for me.

I had a huge list of items, which had multiple copies in field 1, and a special sequential number in field 2. I needed the "newest" or highest sequential number from each unique field 1.

I had to use sort -rn to push them up to the "first entry" position, as the first step is write, then compare the next entry, as opposed to getting the last/most recent in the list.

Thank ArjunShankar!

score 0 · Answer 3 · answered Feb 13 '12 at 09:27

0

If you don't need to preserve the original ordering, one way to do this is using sort:

sort -u <file>

answered Feb 13 '12 at 09:27

jcollado

39,419
8
102
133

sed/awk + regex delete duplicate lines where first field matches (ip address)

3 Answers3

Linked