0

How can I make a loop function in R in order to calculate the number of points (in percentage) which fall outside the pink line for each age unit (1-2,2-3,3-4 and,,,,, 18-19)? I mean For-example, I want to see how many points in age interval between 1-2 have a higher value than the estimated pink curve for that specific age interval and then calculate the percentage (the number of points which have a higher value than estimated value divided by the total number of observations for that specific interval)? I need to do it for each one unit age interval (1-2,2-3,3-4,4-5,5-6,6-7,,,,,17-18,18-19).

Forexample:

   Age     Value     estimated Value 
    1.5     12          12
    1.5     12          14
    1.7     13          15
    1.8     14          9 
    2.1     12          15
    2.2     14          16
    2.3     14          13
    3       8           8.1
    4       9           9.1
    4.1     5           6.1
    4.2     5           12
    5       14          15

The result should be something like
Age:                          1-2    2-3    3-4  4-5
number of points *outside*     1      1 
percentage                     1/4    1/3                 

My initial code: (but I need to make it as a loop function in order to have the results for all age units)

a=1
b=2
A<-subset(Data, Age>=a & Age<b)
sum(A$Value > A$EstimatedValue)/nrow(A)

enter image description here

shoo
  • 87
  • 1
  • 11
  • 1
    You want to calculate `yi - f(xi)`. If `f` is vectorized, you don't need a loop for this. If you need more help, please provide a reproducible example. – Roland Nov 15 '18 at 14:52
  • 2
    Please give a reproducible example. I suspect that what you want is very, very easy, but it is hard to say just how without knowing the structure of your data, how you generated the curve, etc. Please read [How to make a great R reproducible example](https://stackoverflow.com/q/5963269/4996248) – John Coleman Nov 15 '18 at 14:52
  • 1
    (a) don't use a loop. (b) use whatever model you used to generate the pink line. It should have a `predict` method. Augment your data with the predictions and then do `sum(your_data$y_column > your_data$prediction_column)`. If you need more help than that, post a reproducible example with some sample data and the code for the model. – Gregor Thomas Nov 15 '18 at 14:52

1 Answers1

3

Using dplyr:

library(dplyr)
dd %>%
  mutate(age_bin = cut(Age, breaks = 0:20)) %>%
  group_by(age_bin) %>%
  summarize(n_points = n(),
            n_over_estimate = sum(Value > estimated_Value),
            pct_over_estimate = n_over_estimate / n_points * 100)
#   age_bin n_points n_over_estimate pct_over_estimate
#   <fct>      <int>           <int>             <dbl>
# 1 (1,2]          4               1                25
# 2 (2,3]          4               1                25
# 3 (3,4]          1               0                 0
# 4 (4,5]          3               0                 0

And this sample data:

dd = read.table(text = "Age     Value     estimated_Value 
    1.5     12          12
    1.5     12          14
    1.7     13          15
    1.8     14          9 
    2.1     12          15
    2.2     14          16
    2.3     14          13
    3       8           8.1
    4       9           9.1
    4.1     5           6.1
    4.2     5           12
    5       14          15", header = TRUE)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294