1

I have a big text file like this example:

example:

chr1    109472560   109472561   -4732   CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1

there are some repeated lines and I want to take only one repeat of them. for the above example the expected output would look like this:

chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1

I am trying to do that in awk using the following command:

awk myfile.txt | uniq > uniq_file_name.txt

but the output is empty. do you know how to fix it?

user10657934
  • 137
  • 10
  • your command fails because `awk` fails. you did not specify what awk needs to do. If `myfile.txt` is just your output, then you could do `uniq myfile.txt`. This will remove consecutive lines which are duplicates. If you want to remove all lines which are duplicates, you can do `awk '!a[$0]++' myfile.txt`. Remark these are two completely different solutions. – kvantour Dec 12 '18 at 13:17
  • With repeated, do you mean consecutively repeated or just remove the duplicates form the file? – kvantour Dec 12 '18 at 13:20
  • Possible duplicate: https://stackoverflow.com/questions/1444406/how-can-i-delete-duplicate-lines-in-a-file-in-unix (accepted answer but also comment under question) – kvantour Dec 12 '18 at 13:21
  • 2
    Possible duplicate of [How can I delete duplicate lines in a file in Unix?](https://stackoverflow.com/questions/1444406/how-can-i-delete-duplicate-lines-in-a-file-in-unix) – kvantour Dec 12 '18 at 13:22

4 Answers4

3

EDIT: Since hek2mgl sir mentioned in case you need to remove continuous similar lines then try following.

Let's say following is Input_file:

cat Input_file
chr1    109472560   109472561   -4732   CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109472560   109472561   -4732   CLCC1

Run following code now:

awk 'prev!=$0;{prev=$0}'  Input_file

Output will be as follows.

chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109472560   109472561   -4732   CLCC1


The following snippet will remove all duplicate lines, not only repeating lines

awk '!a[$0]++'  Input_file

Append > output_file to above command in case you want to take output into a separate file.

Explanation: Adding explanation for above code now. This is only for explanation purposes for running code use above mentioned one only.

awk '
!a[$0]++      ##Checking condition here if current line is present in array a index or NOT, if not then increase its value by 1.
              ##So that next time it will make condition as FALSE, since we need to have only unique lines.
              ##awk works on method of condition and action, so if condition is TRUE it will do some action mentioned by programmer.
              ##Here I am not mentioning action so by default print of current line will happen, whenever condition is TRUE.
'  Input_file  ##mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
2

Your command:

$ awk myfile.txt | uniq > uniq_file_name.txt

and more precisely this part:

$ awk myfile.txt

will hang as there is no program or script for awk to execute. The minimum you need to do to print all the lines is:

$ awk 1 myfile.txt

But since you had no awk script, I assume you don't need awk, then just use uniq (depending on your need, either):

$ uniq myfile.txt
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1

or

$ sort myfile.txt | uniq

which for that input will produce the same output.

Update:

Regarding the discussion in the comments about why sort: If repeated lines means all duplicated records in the file, use sort. If it means consecutive duplicated lines forget the sort.

James Brown
  • 36,089
  • 7
  • 43
  • 59
2

This is to show the difference between uniq, awk '!a[$0]++' and sort -u.

uniq: removes the consequitive duplicate lines, keeps order :

$ echo "b\nb\na\nb\nb" | uniq
b
a
b

awk !a[$0]++: removes all duplicates, keeps order

$ echo "b\nb\na\nb\nb" | awk '!a[$0]++'
b
a

sort -u: removes all duplicates and sorts the output

$ echo "b\nb\na\nb\nb" | sort -u
a
b
kvantour
  • 25,269
  • 4
  • 47
  • 72
1

Using Perl

> cat user106.txt
chr1    109472560   109472561   -4732   CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1
chr1    109477498   109477499   206 CLCC1
> perl -ne ' print if $kv{$_}++ == 1 ' user106.txt
chr1    109472560   109472561   -4732   CLCC1
chr1    109477498   109477499   206 CLCC1
>

To remove repeated lines

> echo "a\nb\nb\nb\nc\nc\nd\na" | perl -ne ' print if $prev ne $_ ; $prev=$_ ' -
a
b
c
d
a
>
stack0114106
  • 8,534
  • 3
  • 13
  • 38