-2

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):

SCAFF      SITE
1           1
1           2
1           3
1           4
1           5 
3           1
3           2
3           34
3           35
3           36

should be converted into $indiv.bed:

SCAFF      SITE-START   SITE-END
1              1            5
3              1            2
3              34           36

Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??

COMMAND:

for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
      awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done

DESCRIPTION:

awk '{print $1}' $indiv.txt | uniq  #outputs a list with the unique scaffold numbers 

awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff

awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash

sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed

Thanks a lot in advance!

mernster
  • 9
  • 1
  • Maybe https://stackoverflow.com/a/38627863/874188 can help you straighten out this pretzel logic slightly. – tripleee Aug 30 '19 at 13:02
  • 3
    To all the "put on hold"-ers: The question doesn't say "this isn't working", but "it's slow". Are such questions really off-topic here? The code runs well and shows the expected output. – choroba Aug 30 '19 at 13:02
  • Also, here's a solution in Perl that should be faster: `tail -n+2 indiv.txt | sort -u -nk1,1 -nk2,2 | perl -ane 'END {print " $F[1]"} next if $p[0] == $F[0] && $F[1] == $p[1] + 1; print " $p[1]\n@F"; } continue { @p = @F;'` – choroba Aug 30 '19 at 13:22
  • solution using single `awk` call; assumes data is in file`scaff.txt`; applies set of tests/actions as each line is read: `awk 'function print_scaff () { if ( scaff != -9 ) { printf "%-10s %-15s %-10s\n",scaff,start,end } } $1 == "SCAFF" { printf "%-10s %-15s %-10s\n","SCAFF","SITE-START","SITE-END"; scaff=-9; next } $1 != scaff { print_scaff() ; scaff=$1; start=$2; end=$2; next } $2 != (end+1) { print_scaff(); start=$2; end=$2 ; next } $2 == (end+1) { end=$2 ; next } { print_scaff() } END { print_scaff() }' scaff.txt` NOTE: had to strip out comments to fit into this comment block – markp-fuso Aug 30 '19 at 14:38
  • @mernster key objectives: 1) eliminate the (possibly large volume of) repetitive scans of the data file and 2) eliminate the (possibly large volume of) sub-processes invoked by the various pipes – markp-fuso Aug 30 '19 at 14:59
  • @mernster: the single `awk` solution also assumes the data is already sorted (as displayed by your example); if this is an invalid assumption then you'll need to pre-sort the data (see @choroba's `perl` solution for the pre-sort step) – markp-fuso Aug 30 '19 at 17:07
  • @choroba amazing!! it worked in just a few hours! you make my day :) thank's a lot! – mernster Sep 01 '19 at 21:33
  • @markp thanks a lot for your advice on how to increase the speed super helpful! – mernster Sep 01 '19 at 21:36

1 Answers1

1

Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.

I'd reach for Perl:

tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
             next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
             print " $p[1]\n@F";
             } continue { @p = @F;' > indiv.bed

The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the @F array, the @p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into @p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.

The output is different from yours for sections that have only a single member.

choroba
  • 231,213
  • 25
  • 204
  • 289