-2

I have a somewhat tricky BED file format, which I should convert to a classic BED format so as I can properly use it for further steps:

I have this unconventional BED format:

1   12349   12398   +
1   23523   23578   -
1   23550;23570;23590   23640;23689;23652   +
1   43533   43569   +
1   56021;56078   56099;56155   +

Say that those multiple position rows are representing non-coding fragmented regions.

What I would like to get is a canonical BED file such as:

1   12349   12398   +
1   23523   23578   -
1   23550   23640   +
1   23570   23689   +
1   23590   23652   +
1   43533   43569   +
1   56021   56099   +
1   56078   56155   +

where the poliregions that were mixed in one row, are put in other rows, while maintaining chromosome number and strand.

montrealist
  • 5,593
  • 12
  • 46
  • 68
  • 1
    If I had to guess as to why this is getting downvoted, I'd say that it smells like a request for someone to give you a solution, rather than help with a narrow and specific question about writing code. Showing more of your effort/research might help with that impression. – Charles Duffy Mar 15 '19 at 13:47
  • 2
    Since you tagged this with `R` you can find an R solution here: [Split comma-separated strings in a column into separate rows](https://stackoverflow.com/q/13773770/8366499) – divibisan Mar 15 '19 at 14:59

4 Answers4

1

Could you please try following.

awk '
{
  num=split($2,array1,";")
  num1=split($3,array2,";")
}
num>1 || num1>1{
  for(i=1;i<=num;i++){
     print $1,array1[i],array2[i],$NF
  }
  next
}
1'  Input_file | column -t

Output will be as follows.

1  12349  12398  +
1  23523  23578  -
1  23550  23640  +
1  23570  23689  +
1  23590  23652  +
1  43533  43569  +
1  56021  56099  +
1  56078  56155  +
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • @CharlesDuffy, Apologies, I first usually write one-liner then convert it to actual form, have edited it now. Kindly do let me know in case of any queries. – RavinderSingh13 Mar 15 '19 at 13:57
  • 1
    Looks good -- as-edited, the logic is much clearer. – Charles Duffy Mar 15 '19 at 13:58
  • 1
    BTW, as a note for readers -- I typically expect awk to be ~10x faster than well-written bash; that's probably true here, *especially* if there's a large proportion of lines with `;` splits in the input (heredocs and herestrings in bash are implemented in a way that don't make them particularly efficient). – Charles Duffy Mar 15 '19 at 14:01
  • 1
    Many thanks, I was trying in R but only got to put the second and third fields as new rows, but not retrieving the first and fourth... – Emilio Mármol Sánchez Mar 15 '19 at 14:29
0
#!/usr/bin/env bash
#              ^^^^-- NOT /bin/sh

while read -r a b c d; do
  if [[ $b = *';'* ]]; then         # if b contains any ';'s
    IFS=';' read -r -a ba <<<"$b"   # read string b into array ba
    IFS=';' read -r -a ca <<<"$c"   # read string c into array ca
    for idx in "${!ba[@]}"; do      # iterate over the indices of array ba
      # print a and d with the values for a given index for both ba and ca
      printf '%s\t%s\t%s\t%s\n' "$a" "${ba[idx]}" "${ca[idx]}" "$d"
    done
  else
    printf '%s\t%s\t%s\t%s\n' "$a" "$b" "$c" "$d"
  fi
done

This combines the answers to existing StackOverflow questions:

...and guidance in the BashFAQ:

See this running at https://ideone.com/wmrXPE

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
0
$ cat tst.awk
BEGIN { FS="[[:space:];]+" }
{
    n = (NF - 2) / 2
    for (i=1; i<=n; i++) {
        print $1, $(i+1), $(i+n), $NF
    }
}

$ awk -f tst.awk file
1 12349 12349 +
1 23523 23523 -
1 23550 23590 +
1 23570 23640 +
1 23590 23689 +
1 43533 43533 +
1 56021 56078 +
1 56078 56099 +
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

Try Perl solution

 perl -lane ' if( /;/ and /(\S{2,})\s+(\S{2,})/ ) { 
    $i=0;@x=split(";",$1);@y=split(";",$2); while($i++<scalar(@x)) 
     { print join(" ",$F[0],$x[$i-1],$y[$i-1],$F[-1]) }} else { print } ' emilio.txt| column -t

with the given inputs

$ cat emilio.txt
1   12349   12398   +
1   23523   23578   -
1   23550;23570;23590   23640;23689;23652   +
1   43533   43569   +
1   56021;56078   56099;56155   +

$ perl -lane ' if( /;/ and /(\S{2,})\s+(\S{2,})/ ) { 
  $i=0;@x=split(";",$1);@y=split(";",$2); while($i++<scalar(@x)) 
   { print join(" ",$F[0],$x[$i-1],$y[$i-1],$F[-1]) }} else { print } ' emilio.txt| column -t
1  12349  12398  +
1  23523  23578  -
1  23550  23640  +
1  23570  23689  +
1  23590  23652  +
1  43533  43569  +
1  56021  56099  +
1  56078  56155  +

$
stack0114106
  • 8,534
  • 3
  • 13
  • 38