-1

I have gff file, the contents are like the following (tab separated):

# start gene 1Chr.g1
1Chr    AUGUSTUS        gene    3636    5916    0.1     +       .       ID=1Chr.g1
1Chr    AUGUSTUS        transcript      3636    5916    0.1     +       .       ID=1Chr.g1.t1;Parent=1Chr.g1
1Chr    AUGUSTUS        transcription_start_site        3636    3636    .       +       .       Parent=1Chr.g1.t1
1Chr    AUGUSTUS        exon    3636    3913    .       +       .       Parent=1Chr.g1.t1
1Chr    AUGUSTUS        start_codon     3760    3762    .       +       0       Parent=1Chr.g1.t1
1Chr    AUGUSTUS        intron  3914    3995    1       +       .       
1Chr    AUGUSTUS        CDS     3760    3913    1       +       0       ID=1Chr.g1.t1.cds;Parent=1Chr.g1.t1
1Chr    AUGUSTUS        stop_codon      5628    5630    .       +       0       Parent=1Chr.g1.t1
1Chr    AUGUSTUS        transcription_end_site  5916    5916    .       +       .       Parent=1Chr.g1.t1
# start gene 1Chr.g2
1Chr    AUGUSTUS        gene    5938    8761    0.17    -       .       ID=1Chr.g2
1Chr    AUGUSTUS        transcript      5938    8761    0.17    -       .       ID=1Chr.g2.t1;Parent=1Chr.g2
1Chr    AUGUSTUS        transcription_end_site  5938    5938    .       -       .       Parent=1Chr.g2.t1
1Chr    AUGUSTUS        exon    5938    6594    .       -       .       Parent=1Chr.g2.t1
1Chr    AUGUSTUS        stop_codon      6428    6430    .       -       0       Parent=1Chr.g2.t1
1Chr    AUGUSTUS        intron  6595    7156    0.8     -       .       Parent=1Chr.g2.t1
1Chr    AUGUSTUS        CDS     6428    6594    0.89    -       2       ID=1Chr.g2.t1.cds;Parent=1Chr.g2.t1
# start gene 2Chr.g1
2Chr    AUGUSTUS        gene    11612   13481   0.09    -       .       ID=2Chr.g1
2Chr    AUGUSTUS        transcript      11612   13481   0.09    -       .       ID=2Chr.g1.t1;Parent=2Chr.g1
2Chr    AUGUSTUS        transcription_end_site  11612   11612   .       -       .       Parent=2Chr.g1.t1
2Chr    AUGUSTUS        exon    11612   13481   .       -       .       Parent=2Chr.g1.t1
2Chr    AUGUSTUS        stop_codon      11864   11866   .       -       0       Parent=2Chr.g1.t1
2Chr    AUGUSTUS        CDS     11864   12940   1       -       0       ID=2Chr.g1.t1.cds;Parent=2Chr.g1.t1
2Chr    AUGUSTUS        start_codon     12938   12940   .       -       0       Parent=2Chr.g1.t1
2Chr    AUGUSTUS        transcription_start_site        13481   13481   .       -       .       Parent=2Chr.g1.t1
# start gene 2Chr.g2
2Chr    AUGUSTUS        gene    22876   31223   0.04    +       .       ID=2Chr.g2
2Chr    AUGUSTUS        transcript      22876   31223   0.04    +       .       ID=2Chr.g2.t1;Parent=2Chr.g2
2Chr    AUGUSTUS        transcription_start_site        22876   22876   .       +       .       Parent=2Chr.g2.t1
2Chr    AUGUSTUS        exon    22876   23456   .       +       .       Parent=2Chr.g2.t1
2Chr    AUGUSTUS        exon    23515   24451   .       +       .       Parent=2Chr.g2.t1
2Chr    AUGUSTUS        start_codon     23519   23521   .       +       0       Parent=2Chr.g2.t1

I want to replace the IDs of the genes which are 1Chr.g1, 1Chr.g2, 2Chr.g1, and 2Chr.g2 to just in sequence like start from g1 to end of the IDs like in this case g4.

Expected Output

# start gene g1
1Chr    AUGUSTUS        gene    3636    5916    0.1     +       .       ID=g1
1Chr    AUGUSTUS        transcript      3636    5916    0.1     +       .       ID=g1.t1;Parent=g1
1Chr    AUGUSTUS        transcription_start_site        3636    3636    .       +       .       Parent=g1.t1
1Chr    AUGUSTUS        exon    3636    3913    .       +       .       Parent=g1.t1
1Chr    AUGUSTUS        start_codon     3760    3762    .       +       0       Parent=g1.t1
1Chr    AUGUSTUS        intron  3914    3995    1       +       .
1Chr    AUGUSTUS        CDS     3760    3913    1       +       0       ID=g1.t1.cds;Parent=g1.t1
1Chr    AUGUSTUS        stop_codon      5628    5630    .       +       0       Parent=g1.t1
1Chr    AUGUSTUS        transcription_end_site  5916    5916    .       +       .       Parent=g1.t1
# start gene g2
1Chr    AUGUSTUS        gene    5938    8761    0.17    -       .       ID=g2
1Chr    AUGUSTUS        transcript      5938    8761    0.17    -       .       ID=g2.t1;Parent=g2
1Chr    AUGUSTUS        transcription_end_site  5938    5938    .       -       .       Parent=g2.t1
1Chr    AUGUSTUS        exon    5938    6594    .       -       .       Parent=g2.t1
1Chr    AUGUSTUS        stop_codon      6428    6430    .       -       0       Parent=g2.t1
1Chr    AUGUSTUS        intron  6595    7156    0.8     -       .       Parent=g2.t1
1Chr    AUGUSTUS        CDS     6428    6594    0.89    -       2       ID=g2.t1.cds;Parent=g2.t1
# start gene g3
2Chr    AUGUSTUS        gene    11612   13481   0.09    -       .       ID=g3
2Chr    AUGUSTUS        transcript      11612   13481   0.09    -       .       ID=g3.t1;Parent=g3
2Chr    AUGUSTUS        transcription_end_site  11612   11612   .       -       .       Parent=g3.t1
2Chr    AUGUSTUS        exon    11612   13481   .       -       .       Parent=g3.t1
2Chr    AUGUSTUS        stop_codon      11864   11866   .       -       0       Parent=g3.t1
2Chr    AUGUSTUS        CDS     11864   12940   1       -       0       ID=g3.t1.cds;Parent=g3.t1
2Chr    AUGUSTUS        start_codon     12938   12940   .       -       0       Parent=g3.t1
2Chr    AUGUSTUS        transcription_start_site        13481   13481   .       -       .       Parent=g3.t1
# start gene g4
2Chr    AUGUSTUS        gene    22876   31223   0.04    +       .       ID=g4
2Chr    AUGUSTUS        transcript      22876   31223   0.04    +       .       ID=g4.t1;Parent=g4
2Chr    AUGUSTUS        transcription_start_site        22876   22876   .       +       .       Parent=g4.t1
2Chr    AUGUSTUS        exon    22876   23456   .       +       .       Parent=g4.t1
2Chr    AUGUSTUS        exon    23515   24451   .       +       .       Parent=g4.t1
2Chr    AUGUSTUS        start_codon     23519   23521   .       +       0       Parent=g4.t1

I wrote the following bash script, but it took too long, as I tried to count its time, so for one sed it took 1 second, and if there are 28000 iterations it will take about 8 hours, which is too much time. Is there any efficient way to do this?

awk '$3 == "gene"' $1 |cut -f9 |grep -o "=.*" |sed -e 's/=//g' >LIST.txt


COUNTER=0
cat LIST.txt | while read line; do
        COUNTER=$(expr $COUNTER + 1)
        echo "sed -i 's/$line/g$COUNTER/g' $1" |bash
done
rm LIST.txt

Another thing, generate a file sedTG45 which is very annoying.

Mendel
  • 65
  • 6
  • 3
    Can you show a few lines of expected output? – Dave Pritlove Dec 27 '22 at 12:32
  • 2
    See [Bash `while read` loop extremely slow compared to `cat`, why?](https://stackoverflow.com/questions/13762625/bash-while-read-loop-extremely-slow-compared-to-cat-why) and perhaps https://stackoverflow.com/questions/65538947/counting-lines-or-enumerating-line-numbers-so-i-can-loop-over-them-why-is-this - the simple fix is probably to refactor this to a single Awk script. – tripleee Dec 27 '22 at 12:53
  • Agree with @tripleee about doing this in a single awk script. However you could also optimize this by building a sed script of all those sed commands and then running it in one go, or by remembering the original line numbers and addressing those sed commands only to the relevant line. – Verpous Dec 27 '22 at 13:35
  • Also you will need to use double quotes around the sed command for variable expansion to take place. – Verpous Dec 27 '22 at 13:37
  • @Verpous How I can do this using `awk`? – Mendel Dec 27 '22 at 13:41
  • 2
    @Mendel Awk has functions like [gsub](https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions) which mirror sed's `s` command, it has variables so you can have your counter in there, it's a complete language built exactly for this sort of thing. You don't need to know awk well to figure this one out. If you have a familiarity with C-like languages, go over a basic awk tutorial, and keep the manual close by, that should be enough. – Verpous Dec 27 '22 at 13:52
  • 1
    `sed` itself is not slow. Creating a child process which runs `sed` for **each** input line is slow. – user1934428 Dec 27 '22 at 14:07
  • 1
    Aside from this, I suggest that you paste your code into [shellcheck](https://www.shellcheck.net/) to catch the other errors. – user1934428 Dec 27 '22 at 14:08
  • 2
    Reduce your sample input to 5 or 6 lines and add your expected output so we can help you. – Ed Morton Dec 27 '22 at 15:10
  • See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) for why your script is so slow (and why it could corrupt your data). Also see https://porkmail.org/era/unix/award, https://stackoverflow.com/questions/673055/correct-bash-and-shell-script-variable-capitalization, and https://mywiki.wooledge.org/Quotes for other issues, some of which http://shellcheck.net would tell you about. – Ed Morton Dec 27 '22 at 15:12
  • Ok I have edit the question accorringly. – Mendel Dec 27 '22 at 15:38
  • Your sample input/output is still too big, you're supposed to come up with a [mcve] to make it easy for us to help you, we shouldn't need scroll bars to read it. I'm sure you could come up with an example that's less than about 10 lines and 20 chars per line that demonstrates your needs and if you do that then more people will be willing/able to help you as they won't have to invest as much time into trying to understand your example. Don't assume any of us know what a `gff` file is - it's all just strings in rows and columns. – Ed Morton Dec 27 '22 at 17:40
  • Please [edit] your question to tell us what separates your fields - tabs or blanks or something else? – Ed Morton Dec 27 '22 at 17:53

3 Answers3

0

You are running sed on the same file up to 28,000 times. With a bit of preprocessing, it's not hard to get that down to once.

This is untested, but should at the very least point you in the general direction of Awk. This is a small language; you can learn it in less than an hour, and get quite good at it in a few weeks.

awk -F '\t' '$3 == "gene" {
    g=$9; sub(/^[^=]*=/, "", g); gsub(/=/, "", g);
    a[g] = "g" ++n }
  { for(k in a) gsub(k, a[k]) }1' "$1"

In very brief n maintains the counter and a is an associative array of all the genes we have plucked out.

This assumes that the definition precedes the occurrences which should be replaced. If that's not a valid assumption, you'll need two iterations over the file, but the first will be read-only, and so this should still be significantly faster than your brute-force attempt.


Addendum: If you were hellbent on sticking to your (updated) code, moving the pipe to Bash (or maybe just sed) after done would improve it significantly. Here's a light refactoring. It would still be prettier in Awk.

# Use lower case for private variable
counter=0

# Note quotes around $1
# and use of pipe instead of temp file
awk '$3 == "gene"' "$1" |
cut -f9 |
grep -o "=.*" |
sed -e 's/=//g' |
# note IFS='' and read -r
while IFS='' read -r line; do
    # avoid paleolithic expr
    ((counter++))
    echo "s/$line/g$COUNTER/g"
done |
# pipe output to sed -i
sed -i -f - "$1"

Not all seds allow -f -. There will inevitably be a temporary file while sed -i runs but it gets deleted when it's done.

Demo, with timings: https://ideone.com/GghqiW

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks for your suggestion, but it is as slow as the loop. – Mendel Dec 27 '22 at 15:26
  • How are you testing that? Your loop reads and writes the input file thousands of times if you have thousands of genes you want to replace; this only reads it once. – tripleee Dec 27 '22 at 15:36
  • I am running this as follows: `awk -F '\t' '$3 == "gene" { g=$9; sub(/^[^=]*=/, "", g); gsub(/=/, "", g); a[g] = "g" ++n } { for(k in a) gsub(k, a[k]) }1' foo.gff` – Mendel Dec 27 '22 at 15:40
  • Just once? And how long is that taking? Per line? – tripleee Dec 27 '22 at 15:45
  • it is taking the following time for the above data `real 0m0.002s` and loop taking `real 0m0.043s` – Mendel Dec 27 '22 at 15:51
  • If you run it on large file it is too much slow – Mendel Dec 27 '22 at 15:52
  • 1
    So a 20x speedup is "just as slow", really? Anyway, perhaps see updated answer now. There is no way to completely remove the overhead of I/O on a large input file; but perhaps my refactoring of your `sed` code is actually faster than Awk now. – tripleee Dec 27 '22 at 16:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/250685/discussion-between-mendel-and-tripleee). – Mendel Dec 27 '22 at 16:48
0

Generating sed commands first might help. The inputfile will be read twice, but that should not be a problem. Please check the performance.
The sed commands you want look like

s/1Chr.g1.t1/g1/g
s/1Chr.g2.t1/g2/g
s/2Chr.g1.t1/g3/g
s/2Chr.g2.t1/g4/g

When numbers grow above 9, these commands can be wrong, so change it a little:

awk '
  ($0 != last) {
    n++;
    printf("s/([^0-9])%s([^0-9]|$)/\\1g%s\\2/g\n", $0, n)
  }
  {
    last=$0
  }' <(grep -Eo "[0-9]+Chr.g[0-9]+" < file)

Returns

s/([^0-9])1Chr.g1([^0-9]|$)/\1g1\2/g
s/([^0-9])1Chr.g2([^0-9]|$)/\1g2\2/g
s/([^0-9])2Chr.g1([^0-9]|$)/\1g3\2/g
s/([^0-9])2Chr.g2([^0-9]|$)/\1g4\2/g

Now use these commands in sed (replace file with something like $1 on 2 places, add -i to sed when it looks ok):

sed -r -f <(
    awk '
      ($0 != last) {
        n++;
        printf("s/([^0-9])%s([^0-9]|$)/\\1g%s\\2/g\n", $0, n)
      }
      {
        last=$0
      }' <(grep -Eo "[0-9]+Chr.g[0-9]+" < file)
  ) file
Walter A
  • 19,067
  • 2
  • 23
  • 43
0

sed 's/\([= ]\)[0-9]\+Chr\.\(g[0-9]\+\)/\1\2/g' file.gff

if you dare sed -i ,
I would recommend making a copy first.

Also note: will only work with digit based chromosomes so no X,Y, or mt or I, II,III, IV ... etc.

tomc
  • 1,146
  • 6
  • 10