Linux filter text rows by sum specific colums

Question

From raw sequencing data I created a count file (.txt) with the counts of unique sequences per sample. The data looks like this:

sequence    seqLength S1   S2   S3   S4   S5   S6   S7   S8
AAAAA...    46        0    1    1    8    1    0    1    5
AAAAA...    46        50   1    5    0    2    0    4    0
...
TTTTT...    71        0    0    5    7    5    47   2    2
TTTTT...    81        5    4    1    0    7    0    1    1

I would like to filter the sequences per row sum, so only rows with a total sum of all samples (sum of S1 to S8) lower than for example 100 are removed.

This can probably be done with awk, but I have no experience with this text-processing utility. Can anyone help?

sorry, but you are expected to show what you have tried.. you can check out https://stackoverflow.com/tags/awk/info for learning resources and other helpful links.. — Sundeep, Apr 09 '18 at 10:14
Please add expected output in your post in CODE TAGS too and let us know on same then. — RavinderSingh13, Apr 09 '18 at 10:15
https://stackoverflow.com/questions/27104990/sum-of-all-rows-of-all-columns-bash has something you could start with.. it shows how to use loop to get sum of values.. you can change starting column from `1` to what you need... — Sundeep, Apr 09 '18 at 10:19
@Sundeep Sorry, this is my first time posting on stackoverflow. Next time I will keep this in mind. I really tried to do this myself but I kept on receiving errors. — Aquilifer, Apr 09 '18 at 10:46
@RavinderSingh13 What do you mean by CODE TAGS? As in the format of output? — Aquilifer, Apr 09 '18 at 10:48
@Aquilifer, please refer this post https://meta.stackoverflow.com/questions/251361/how-do-i-format-my-code-blocks — RavinderSingh13, Apr 09 '18 at 10:53
@Aquilifer `but I kept on receiving errors` .. in that case, just add some of the code you tried to question.. SO is all about getting help with what you've tried :) — Sundeep, Apr 09 '18 at 11:04

nbari · Accepted Answer · 2018-04-09T10:30:02.563

3

Give a try to this:

awk 'NR>1 {sum=0; for (i=3; i<=NF; i++) { sum+= $i } if (sum > 100) print}' file.txt

It will skip line 1 NR>1 Then will sum items per row starting from item 3 (S1 to S8) in your example:

{sum=0; for (i=3; i<=NF; i++) { sum+= $i }

Then will only print rows with sum is > than 100: if (sum > 100) print}'

You could modify/test with the condition based on the sum, but hope this can give you an idea about how to do it with awk

edited Apr 09 '18 at 10:30

answered Apr 09 '18 at 10:24

nbari

25,603
10
76
131

Thanks for you answer! I just ran the code and it works. Can I save the result to another file by using this code? awk 'NR>1 {sum=0; for (i=3; i<=NF; i++) { sum+= $i } if (sum > 100) print}' file.txt | new_file.txt – Aquilifer Apr 09 '18 at 10:44
@Aquilifer nice that it worked :-), you could save the output into another file by just using `> new_file.txt` – nbari Apr 09 '18 at 10:48

score 1 · Answer 2 · answered Apr 09 '18 at 10:51

Following awk may help you on same.

awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"};sum=""}'   Input_file

In case you need different different out files then following may help.

awk 'FNR>1{for(i=3;i<=NF;i++){sum+=$i};if(sum>100){print sum > "out_file"++i};sum=""}'  Input_file

Linux filter text rows by sum specific colums

2 Answers2