how to remove repeated rows in awk

Question

I have a big text file like this example:

example:

chr1    109472560   109472561   -4732   CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1

there are some repeated lines and I want to take only one repeat of them. for the above example the expected output would look like this:

chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1

I am trying to do that in awk using the following command:

awk myfile.txt | uniq > uniq_file_name.txt

but the output is empty. do you know how to fix it?

your command fails because `awk` fails. you did not specify what awk needs to do. If `myfile.txt` is just your output, then you could do `uniq myfile.txt`. This will remove consecutive lines which are duplicates. If you want to remove all lines which are duplicates, you can do `awk '!a[$0]++' myfile.txt`. Remark these are two completely different solutions. — kvantour, Dec 12 '18 at 13:17
With repeated, do you mean consecutively repeated or just remove the duplicates form the file? — kvantour, Dec 12 '18 at 13:20
Possible duplicate: https://stackoverflow.com/questions/1444406/how-can-i-delete-duplicate-lines-in-a-file-in-unix (accepted answer but also comment under question) — kvantour, Dec 12 '18 at 13:21
Possible duplicate of [How can I delete duplicate lines in a file in Unix?](https://stackoverflow.com/questions/1444406/how-can-i-delete-duplicate-lines-in-a-file-in-unix) — kvantour, Dec 12 '18 at 13:22

RavinderSingh13 · Answer 1 · 2018-12-12T14:00:34.620

EDIT: Since hek2mgl sir mentioned in case you need to remove continuous similar lines then try following.

Let's say following is Input_file:

cat Input_file
chr1    109472560   109472561   -4732   CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109472560   109472561   -4732   CLCC1

Run following code now:

awk 'prev!=$0;{prev=$0}'  Input_file

Output will be as follows.

chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109472560   109472561   -4732   CLCC1

The following snippet will remove all duplicate lines, not only repeating lines

awk '!a[$0]++'  Input_file

Append > output_file to above command in case you want to take output into a separate file.

Explanation: Adding explanation for above code now. This is only for explanation purposes for running code use above mentioned one only.

awk '
!a[$0]++      ##Checking condition here if current line is present in array a index or NOT, if not then increase its value by 1.
              ##So that next time it will make condition as FALSE, since we need to have only unique lines.
              ##awk works on method of condition and action, so if condition is TRUE it will do some action mentioned by programmer.
              ##Here I am not mentioning action so by default print of current line will happen, whenever condition is TRUE.
'  Input_file  ##mentioning Input_file name here.

@hek2mgl, I thought OP needs to remove duplicates, let me add another case too now. — RavinderSingh13, Dec 12 '18 at 13:30
@hek2mgl, sir, I have added EDIT solution now, by taking an example Input_file please correct me if I missed something here. — RavinderSingh13, Dec 12 '18 at 13:39
I would rephrase `Could you please try following` by `The following snippet will remove all duplicate lines, not only repeating lines` — hek2mgl, Dec 12 '18 at 13:57

James Brown · Answer 2 · 2018-12-12T13:33:29.603

2

Your command:

$ awk myfile.txt | uniq > uniq_file_name.txt

and more precisely this part:

$ awk myfile.txt

will hang as there is no program or script for awk to execute. The minimum you need to do to print all the lines is:

$ awk 1 myfile.txt

But since you had no awk script, I assume you don't need awk, then just use uniq (depending on your need, either):

$ uniq myfile.txt
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1

or

$ sort myfile.txt | uniq

which for that input will produce the same output.

Update:

Regarding the discussion in the comments about why sort: If repeated lines means all duplicated records in the file, use sort. If it means consecutive duplicated lines forget the sort.

edited Dec 12 '18 at 13:33

answered Dec 12 '18 at 12:22

James Brown

36,089
7
43
59

I would remove the `sort myfile.txt | uniq`. It's misleading and pointless because it will only produce the same output when the input is already sorted. But why use `sort` then? – hek2mgl Dec 12 '18 at 13:17
Well, those few sample lines might be in sorted order but I don't know about the rest. – James Brown Dec 12 '18 at 13:26
Yeah, that's why the anwer to `how to remove repeated rows` is `uniq`, not `sort | uniq` – hek2mgl Dec 12 '18 at 13:28
Why not `sort -u`? – kvantour Dec 12 '18 at 13:29
1

because that will remove all duplicates, not just repeated lines – hek2mgl Dec 12 '18 at 13:29

kvantour · Answer 3 · 2018-12-12T13:31:25.253

2

This is to show the difference between uniq, awk '!a[$0]++' and sort -u.

uniq: removes the consequitive duplicate lines, keeps order :

$ echo "b\nb\na\nb\nb" | uniq
b
a
b

awk !a[$0]++: removes all duplicates, keeps order

$ echo "b\nb\na\nb\nb" | awk '!a[$0]++'
b
a

sort -u: removes all duplicates and sorts the output

$ echo "b\nb\na\nb\nb" | sort -u
a
b

edited Dec 12 '18 at 13:31

answered Dec 12 '18 at 13:28

kvantour

25,269
4
47
72

stack0114106 · Answer 4 · 2018-12-12T14:21:41.637

1

Using Perl

> cat user106.txt
chr1    109472560   109472561   -4732   CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1
> perl -ne ' print if $kv{$_}++ == 1 ' user106.txt
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
>

To remove repeated lines

> echo "a\nb\nb\nb\nc\nc\nd\na" | perl -ne ' print if $prev ne $_ ; $prev=$_ ' -
a
b
c
d
a
>

edited Dec 12 '18 at 14:21

answered Dec 12 '18 at 12:31

stack0114106

8,534
3
13
38

This will remove all duplicated lines, not only repeated lines – hek2mgl Dec 12 '18 at 14:00
ok - looks like some discussions happened after I posted.. let me work on it – stack0114106 Dec 12 '18 at 14:06

how to remove repeated rows in awk

4 Answers4