Remove a duplicate line in text-file that has one word change per line

Question

somethingsame,somethingsame_usage,2015-11-30 01:00:00,0
somethingsame,somethingsame_usage,2015-11-30 02:00:00,0
somethingsame,somethingsame_usage,2015-11-30 03:00:00,0
somethingelse,somethingelse_usage,2015-11-30 01:00:00,0
somethingelse,somethingelse_usage,2015-11-30 02:00:00,0
somethingelse,somethingelse_usage,2015-11-30 03:00:00,0

I want to remove lines and have end product:

somethingsame,somethingsame_usage,2015-11-30 02:00:00,0
somethingelse,somethingelse_usage,2015-11-30 03:00:00,0

The only thing changing is the time (per data i.e. somethingsame vs. somethingelse) that is different but everything else is the same. It does not matter which line/time I keep; I just want only one.

Welcome to SO. Your post could be improved if you've showed us what code have you tried so far. — Richard Erickson, Dec 01 '15 at 22:57

score 4 · Accepted Answer · answered Dec 02 '15 at 01:02

4

If you don't care what order the lines are output, you can do this with sort, using the -u (unique) command-line flag, which includes only one instance of a set of unique lines.

Unlike uniq, sort -u compares only the part of the line defined by the -k options, so you can specify precisely which fields are to be counted as part of the uniqueness test. So in this case, you could use:

sort -u -t, -k1,2

where -t, means that the field delimiter is a comma, and -k1,2 means that the "key" consists of everything from the first character in the first field to the last character in the second field.

Note that the -k argument is a range, not a list: -k1,3 would mean the first three fields, while -k2 would mean "from the second field to the end of the line".

answered Dec 02 '15 at 01:02

rici

234,347
28
237
341

Minahil, if your input file is not sorted, this is a very good solution. If your input is sorted, calling sort again on it can be redundant and costly in terms of processing time. Depends on the whether you have a few KB or a few GB of input. – Ramón Gil Moreno Dec 02 '15 at 01:14
@RamónGilMoreno: The sort utility is well optimized for large files which are already sorted. – rici Dec 02 '15 at 01:16
I had read this question in the past http://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sort-a-very-large-file and I don't have any doubt sort will deal wilth large files, but stills looks like consuming a big bunch of resources. IMHO. – Ramón Gil Moreno Dec 02 '15 at 01:20
my input is not sorted and I'm dealing with around 500,000 lines, what's another solution, this seems to work. – Dec 02 '15 at 16:47
@Minahil: If your input is not sorted and you want to remove duplicates even if they are not consecutive, this is the solution. – rici Dec 02 '15 at 19:24

score 2 · Answer 2 · answered Dec 02 '15 at 01:22

2

idiomatic awk solution is as follows

$ awk -F, '!a[$1]++' log

somethingsame,somethingsame_usage,2015-11-30 01:00:00,0
somethingelse,somethingelse_usage,2015-11-30 01:00:00,0

picks up the first instance.

answered Dec 02 '15 at 01:22

karakfa

66,216
7
41
56

score 0 · Answer 3 · answered Dec 02 '15 at 01:02

The following solution uses awk, but it is not my favourite (will write it in a separate answer).

What it does?

Line by line, simply keeps track of previous line's relevant values (first two fields, stored in variables previous1 and previous2). These values are updated at the end of processing the line.

Upon finding a line where current values (current1 and current2) are different from the previous ones, simply call print $0 to print the whole line.

I also configure the fields separator (FS value) to be comma.

You can build more elaborate criteria to decide if two lines are equals or not and if the new line needs printing.

Here is the complete console dump:

$ cat input.txt 
somethingsame,somethingsame_usage,2015-11-30 01:00:00,0
somethingsame,somethingsame_usage,2015-11-30 02:00:00,0
somethingsame,somethingsame_usage,2015-11-30 03:00:00,0
somethingelse,somethingelse_usage,2015-11-30 01:00:00,0
somethingelse,somethingelse_usage,2015-11-30 02:00:00,0
somethingelse,somethingelse_usage,2015-11-30 03:00:00,0
$ awk 'BEGIN { FS="," } { current1 = $1; current2 = $2; if ((previous1 != current1) && (previous2 != current2)) { print $0 } previous1 = current1; previous2 = current2; }' input.txt
somethingsame,somethingsame_usage,2015-11-30 01:00:00,0
somethingelse,somethingelse_usage,2015-11-30 01:00:00,0
$

score 0 · Answer 4 · answered Dec 02 '15 at 01:09

This is a different solution using uniq given that your input file is already sorted.

Note that the hack is that I simply strip the irrelevant part of the line, thus it won't appear in the result:

$ cat input.txt
somethingsame,somethingsame_usage,2015-11-30 01:00:00,0
somethingsame,somethingsame_usage,2015-11-30 02:00:00,0
somethingsame,somethingsame_usage,2015-11-30 03:00:00,0
somethingelse,somethingelse_usage,2015-11-30 01:00:00,0
somethingelse,somethingelse_usage,2015-11-30 02:00:00,0
somethingelse,somethingelse_usage,2015-11-30 03:00:00,0
$ cat input.txt | awk 'BEGIN { FS = "," } { print $1 "," $2 }' | uniq
somethingsame,somethingsame_usage
somethingelse,somethingelse_usage
$

Remove a duplicate line in text-file that has one word change per line

4 Answers4