2

From raw sequencing data I created a count file (.txt) with the counts of unique sequences per sample. The data looks like this:

sequence    seqLength S1   S2   S3   S4   S5   S6   S7   S8
AAAAA...    46        0    1    1    8    1    0    1    5
AAAAA...    46        50   1    5    0    2    0    4    0
...
TTTTT...    71        0    0    5    7    5    47   2    2
TTTTT...    81        5    4    1    0    7    0    1    1

I would like to filter the sequences per row sum, so only rows with a total sum of all samples (sum of S1 to S8) lower than for example 100 are removed.

This can probably be done with awk, but I have no experience with this text-processing utility. Can anyone help?

Aquilifer
  • 25
  • 6
  • sorry, but you are expected to show what you have tried.. you can check out https://stackoverflow.com/tags/awk/info for learning resources and other helpful links.. – Sundeep Apr 09 '18 at 10:14
  • Please add expected output in your post in CODE TAGS too and let us know on same then. – RavinderSingh13 Apr 09 '18 at 10:15
  • https://stackoverflow.com/questions/27104990/sum-of-all-rows-of-all-columns-bash has something you could start with.. it shows how to use loop to get sum of values.. you can change starting column from `1` to what you need... – Sundeep Apr 09 '18 at 10:19
  • @Sundeep Sorry, this is my first time posting on stackoverflow. Next time I will keep this in mind. I really tried to do this myself but I kept on receiving errors. – Aquilifer Apr 09 '18 at 10:46
  • @RavinderSingh13 What do you mean by CODE TAGS? As in the format of output? – Aquilifer Apr 09 '18 at 10:48
  • @Aquilifer, please refer this post https://meta.stackoverflow.com/questions/251361/how-do-i-format-my-code-blocks – RavinderSingh13 Apr 09 '18 at 10:53
  • @Aquilifer `but I kept on receiving errors` .. in that case, just add some of the code you tried to question.. SO is all about getting help with what you've tried :) – Sundeep Apr 09 '18 at 11:04

2 Answers2

3

Give a try to this:

awk 'NR>1 {sum=0; for (i=3; i<=NF; i++) { sum+= $i } if (sum > 100) print}' file.txt

It will skip line 1 NR>1 Then will sum items per row starting from item 3 (S1 to S8) in your example:

{sum=0; for (i=3; i<=NF; i++) { sum+= $i } 

Then will only print rows with sum is > than 100: if (sum > 100) print}'

You could modify/test with the condition based on the sum, but hope this can give you an idea about how to do it with awk

nbari
  • 25,603
  • 10
  • 76
  • 131
  • Thanks for you answer! I just ran the code and it works. Can I save the result to another file by using this code? awk 'NR>1 {sum=0; for (i=3; i<=NF; i++) { sum+= $i } if (sum > 100) print}' file.txt | new_file.txt – Aquilifer Apr 09 '18 at 10:44
  • @Aquilifer nice that it worked :-), you could save the output into another file by just using `> new_file.txt` – nbari Apr 09 '18 at 10:48
1

Following awk may help you on same.

awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"};sum=""}'   Input_file

In case you need different different out files then following may help.

awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"++i};sum=""}'  Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93