split column and calculate mean for each block

Question

I have a file that looks like this (but contains 1000s of individuals):

ind1
0 -14980.8397530869 -15380.4887698560 589.9705014749 0.0001038673
1 -6117.4992483752 -6308.7155249846 2197953628.1638321877 0.0056515118
2 -5944.6996454388 -6135.7353966574 3342427102.6682262421 0.0022743340
3 -5919.1420308529 -6109.6495008350 3808372819.6077227592 0.0013537196
4 -5914.6730224383 -6104.8257104034 4004539990.0168108940 0.0010346189
5 -5913.8449682103 -6103.8235473922 4089253849.9270911217 0.0009059563
ind2
0 -14460.2922418646 -14773.0506815877 589.9705014749 0.0001038673
1 -5920.5367627770 -6029.4001343365 2138866766.8147277832 0.0051484663
2 -5763.8860434281 -5859.2556977093 3233581956.7551069260 0.0019994597
3 -5743.1443207950 -5832.6552230885 3670742051.8126020432 0.0011739290
4 -5740.0577242050 -5826.9514222357 3853293664.2254080772 0.0008832138
5 -5739.7465215368 -5825.4061952257 3932395083.8926229477 0.0007616630

How can I calculate the mean for columns 4 and 5 (independently) between the lines that go from 1 to 5 in a loop for each individual?

Shortly, I would like to obtain 2 mean values (column 4 and column 5) for each individual. Thanks in advance!

Possible duplicate of [How do I use floating-point division in bash?](https://stackoverflow.com/questions/12722095/how-do-i-use-floating-point-division-in-bash) — ceving, Mar 25 '19 at 14:40

karakfa · Answer 1 · 2019-03-25T17:50:40.433

4

awk to the rescue!

$ awk 'function p() {if(c) printf "%s %.10f %.10f\n",  h, s4/c, s5/c}
       /^ind/       {p(); h=$1; c=s4=s5=0; next} 
       $1~/^[1-5]$/ {c++; s4+=$4; s5+=$5} 
       END          {p()}' file

will give

ind1 3488509478.0767364502 0.0022440281
ind2 3365775904.7000937462 0.0019933464

Explanation

defined function p for formatted printing the header and the two computed averages (mean). When encountered the header line, capture the header; reset count and sum of field 4 and 5 variables; when the first field is {1..5} increment count and add the field values to the corresponding variables.

Print the line when switched to a new header and at the end of file encountered.

for the header if it doesn't start with {0..5} values you can substitute !/^[0-5]/ for example. Or, if the header is only one word, you can instead do NF==1 check, or if for sure it contains at least one alpha you can do /[a-zA-Z]/ assuming you locale has this range for the whole chars.

edited Mar 25 '19 at 17:50

answered Mar 25 '19 at 15:08

karakfa

66,216
7
41
56

Looks great, but, I've actually simplified the names of my samples for the question, it's not just "ind" but different names... is there a way to tell `awk` that its a string of variable length with characters and/or numbers?? Sorry for the confusion, thought it was not necessary... – Sonia Olaechea Lázaro Mar 25 '19 at 15:11
if it's a single digit value, there is a problem. You can make the pattern match based of what it is or what it is not (negation). – karakfa Mar 25 '19 at 15:13
I'm trying some of your solutions, but if instead of `/^ind/` I write `/a-zA-Z/`it just prints the mean for the last individual... If I write `!/[0-5]/` instead, it calculates something for all the individuals but is strangely not the mean... where should I implement your `NF==1` solution?? – Sonia Olaechea Lázaro Mar 25 '19 at 15:46
replace `/^ind/` with `NF==1` if the header is just one word (and nothing else is on the line). `!/[0-5]/` should be `!/^[0-5]/` again replacing `/^ind/` pattern only. Or, perhaps post a representative sample of your data. – karakfa Mar 25 '19 at 16:03
Solved! My problem was that I actually have 10 lines instead of 5 for each individual and the program was not recognising `!/^[0-10]` (because `10` has 2 digits)... Now I see that I cannot simplify that much my sample for the post! Sorry for the inconvenience and thanks for everything, it's an elegant solution :) – Sonia Olaechea Lázaro Mar 25 '19 at 16:16
You mean `/[A-Za-z]/`, without the square brackets you almost certainly never match a new individual (the regex would look for the literal string `a-zA-Z` so all your individuals would need to have that in their labels). – tripleee Mar 25 '19 at 16:56
Yes, of course. fixed. – karakfa Mar 25 '19 at 17:50

split column and calculate mean for each block

1 Answers1