1

Hope you're well. I have found questions that come close to what I'm asking but none exactly. I've been struggling on this for the last 2 weeks and have finally managed to make a bit of progress enough that I feel I can justify asking on here!

I have downloaded a HitTable from a sequence I BLASTED

accno    percent     seqstart seqend        
AC020656|33 84.713  116580  116735  
AC020656|33 90.303  118279  118443  
AC020656|33 87.654  120390  121470  
AC020656|33 82.609  121323  121390      
AC123694|11 77.622  158333  158474  
AC123694|11 84.559  158238  160142  

What I am trying to do is find out is which of these hits for each entry (accession number) are overlapping and which are not so I can carry on with the rest of my pipeline. I'm aiming to put the IDs (or whole row, I'm not fussed at this point!) into separate files and then use them to extract the corresponding FASTA files so I can apply the more appropriate programme depending on if it's overlapping or not.

I have felt the best way to go about it is using awk to:

  1. CHECK that the IDs ($1) are the same (no use checking together AC020656|33 line 4 and AC123694|11 line 4!)
  2. COMPARE if the seq end ($4) is smaller than the seq start ($3) of the next row. If so, print it in a file called "nonover.txt", else print to "overlap.txt"

I began an attempt at this answer by modifying code found HERE:

awk '($1==c1 && $3==c3 && $4==c4){print line RS $0}{line=$0;c1=$1;c4<$3}' mydata.txt

But, no surprise, it doesn't work as I am clearly missing something. The OP in the answer I linked very kindly explained it in such a way that I felt confident tweaking it but well, thats my incompetence!

I've also tried to apply my problem using code found HERE and HERE which I've felt have come close to what I'm trying to do. I've also checked out the manual for awk and while it did help a bit (I tried using the getline function but just kept getting smacked with errors), I just honestly don't think I am well versed enough to tackle this issue at the minute.

And as pointed out by markp-fuso, my intended output for the above would ideally be two files with the following data inside:

noverlap.txt (as each rows seqend is smaller than the next rows seqstart, therefore it's not overlapped)
accno    percent     seqstart seqend        
AC020656|33 84.713  116580  116735  
AC020656|33 90.303  118279  118443  
AC020656|33 87.654  120390  121470

overlap.txt (as each rows seqend is larger than the next rows seqstart, and it is overlapping)
accno    percent     seqstart seqend
AC020656|33 87.654  120390  121470  
AC020656|33 82.609  121323  121390  

AC123694|11 77.622  158333  158474  
AC123694|11 84.559  158238  160142 

As pointed out by Ed Morton, how should acc.no be treated if some entries overlap but others don't - it's fine for them to be split up with some entries in both noverlap.txt and overlap.txt. I will be checking for any identical acc.no between the two folders, will treat the overlap first to then add to the nonoverlapping entries and will carry on from there. Duplicates are fine here (see AC020656|33 87.654 120390 121470 in both txt files), I know how I'm treating these, it's just so I can confirm the method to use on my real data.

TL;DR: Using grouping based on id (Acc. no), can I compare data from a column with data in a different column and the row below? Happy with either loop, script or one/two line answers suitable for a OS user

Thank you kindly in advance, any advice is very welcome and I appreciate you taking the time out to read/answer my question.

UPDATE: Thanks to the wonderful Ed Morton for his solution that has done the trick perfectly. I am just adding what I am doing to remove the single, duplicated entries found in the non-overlapping txt file (but found where they should be in overlap) which is modifying the code found at this answer HERE

Fitzy
  • 33
  • 9
  • 1
    What if you have 4 lines for the same acc ($1) and the seqs overlap for the first 2 lines but don't for the 2nd 2 lines? Do all 4 lines for that acc go into 1 file and if so which, the overlap file or the nooverlap file? If not - what does happen? Please [edit] your sample input/output to include that case. – Ed Morton Sep 28 '21 at 22:33
  • Great point, I haven't seen any cases of that yet but have updated accordingly with what would be ideal! – Fitzy Sep 28 '21 at 23:33

1 Answers1

1

This will produce the provided expected output from the provided sample input:

$ cat tst.awk
{ sub(/\r$/,"") }
NR == 1 { hdr = $0 }
NR  > 2 { prt() }
{ prev = $0 }

function prt(   over, noover, p, out) {
    over   = "overlap.txt"
    noover = "noover.txt"

    if ( !doneHdr++ ) {
        print hdr > over
        print hdr > noover
    }

    split(prev,p)
    if ( ($1 == p[1]) && ($3 <= p[4]) ) {
        print prev > over
        print $0   > over
        print ""   > over
    }
    else {
        print prev > noover
    }
}

$ awk -f tst.awk file
$ head *over*
==> noover.txt <==
accno    percent     seqstart seqend
AC020656|33 84.713  116580  116735
AC020656|33 90.303  118279  118443
AC020656|33 82.609  121323  121390

==> overlap.txt <==
accno    percent     seqstart seqend
AC020656|33 87.654  120390  121470
AC020656|33 82.609  121323  121390

AC123694|11 77.622  158333  158474
AC123694|11 84.559  158238  160142

If that's not all you need then edit your question to provide more truly representative sample input/output that includes cases that the above doesn't work for.

Note that the above requires at least 2 data lines in addition to the header to be present in the input. If there was only 1 data line it wouldn't be printed. If that's an issue add some logic to print the hdr and prev in an END section if NR is less than 3 or similar.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Hi Ed, many thanks for your help, I have tried out the script and it's very nearly there! What is happening is that the overlapping sequences are being flagged correctly and put into the overlap.txt. BUT the corresponding sequence is not been added, so it's a single entry. My data is formatted as above, I haven't touched your code, so not sure where mine is going AWOL. – Fitzy Sep 29 '21 at 13:15
  • I don't know what `BUT the corresponding sequence is not been added, so it's a single entry` means, sorry. – Ed Morton Sep 29 '21 at 13:31
  • Sorry - I mean using your answer I have ```AC123694|11 84.559 158238 160142``` in overlap.txt but ```AC123694|11 77.622 158333 158474``` is placed in noover.txt – Fitzy Sep 29 '21 at 13:33
  • 1
    OK, I updated my answer to produce the new expected output from the new input. – Ed Morton Sep 29 '21 at 13:41
  • Ed, I cannot thank you enough, that's done exactly what I've asked for. Thank you so much, I'm so greatful :D (PS, I'm just going to edit your answer to show what I'm doing to remove any duplicates, is that okay? Or is it more appropriate to put that in the comments?) – Fitzy Sep 29 '21 at 14:16
  • 1
    You're welcome. I'd rather you just add something to the end of your question about that since that's something that could be done the right way or the wrong way and I'd just as soon not be involved in that and maybe get comments from other people about it, etc. In my mind at least, this question is answered and it's time to move on. – Ed Morton Sep 29 '21 at 14:52
  • 1
    Thank you once again Ed, I'll amend my post with what I'm using then! All the best and have a great week – Fitzy Sep 29 '21 at 14:54