2

I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).

Initial data looks like:

1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864

Output should look like:

1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864

I've tried following this post and variations, but the adapted code did not work:

sort -t' ' -u -k1,1 -k2,2 input > output 

Result:

1:10020 rs775809821

Can anyone advise? Thanks!

2 Answers2

1

Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon

awk -F'[: ]' '!unique[$2]++' file

The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.

Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do

awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file

The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do

awk '!unique[$1]++' file
Inian
  • 80,270
  • 14
  • 142
  • 161
  • Fantastic, this really did the job! Now, slightly different question, do you know if I could get the content of both duplicated lines removed? (Or, if the same position appears more than once, remove all lines which contain duplicated entries). This would result, for example, in removal also of any line containing "1:10055" in the final output file (because it appears twice). – Rodrigo Duarte May 28 '19 at 13:12
  • @rodduart : Please ask a separate question with the details you have – Inian May 28 '19 at 14:03
0

since your input data is pretty simple, the command is going to be very easy.

sort file.txt | uniq -w7

This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.

yoga
  • 710
  • 5
  • 11
  • Hi Yoga, thanks for your answer - this is a good simple command to learn. Unfortunately, it does not apply to my current data because the length of the number after the colon actually starts with 5 digits at the beginning of the file but increases significantly by the end of the file (and I am not sure what's the max number of digits)... My bad I didn't mention this earlier. I appreciate it anyway! – Rodrigo Duarte May 28 '19 at 20:07