1

This question aims to receive feedback to make a function more efficient. Apologies for a long, case specific post.

I created a function that calculates percentages in estimates of the American Community Survey (ACS). Because estimates in the ACS have margins of error, calculating percentages (e.g. % of the total population being below 17 years old) requires the recalculation of the error that results from dividing the estimate of both variables (population below 17 / total population).

So to calculate the new margin of error for a proportion calculated as p = estimate_a/estimate_b, the formula to use is MOE(p) = (1/estimate_b)*sqrt((MOE_b^2)-(p^2*MOE_a^2)). If the value inside of the square root was negative, then the substraction should be changed to a sum, with the formula becoming MOE(p) = (1/estimate_b)*sqrt((MOE_b^2)+(p^2*MOE_a^2)). If the result of p = estimate_a/estimate_b is 1, the documentation suggest calculating MOE using another formula: MOE(p) = MOE_a/estimate_b

To make these calculations, I created a function that takes a data frame with estimates and their MOEs, calculates the proportion between two specified variables, and writes two new columns in the original dataframe - one with the proportion, and another one with its margin of error. The function loops through the rows of the data frame carrying out if-else checks to determine what formula to apply, including skipping rows that might have NA values. The original data on which I apply this function is considerably long - ~250000 rows, and the structure of this function makes it go very slowly. Hence, the question is whether there are ways to improve the quality of this code to improve its speed. The function and dummy data are provided below:

percent_calculator <- function(DF, A_e, B_e, A_se, B_se, New_fn){

  # arguments legend >> DF = data frame; A_e = estimate_a (string of the fieldname); B_e = estimate_b (string of the fieldname); 
  # A_se = MOE_a (string of the fieldname); B_se = MOE_b (string of the fieldname); New_fn = root for new fieldname in the data frame (string)

  pb<- txtProgressBar(min = 0, max = nrow(DF), initial = 0) # progress bar initialization

  for (i in 1:nrow(DF)){ # for loop that iterates through the rows of the DF

    setTxtProgressBar(pb,i)

    if(is.na(DF[[A_e]][i])==FALSE & is.na(DF[[B_e]][i])==FALSE){ # check if any of the estimates used to calculate the proportion is NA (if so, skip)

      if (DF[[B_e]][i]!= 0){ # check if estimate_b is not 0, to avoid creating inf values from A_e/B-e

        DF[[paste0(New_fn, "_e")]][i] <- (DF[[A_e]][i]/DF[[B_e]][i])

        if(DF[[paste0(New_fn, "_e")]][i] == 1){ # check if P==1 to then use the appropiate formula for MOE

          DF[[paste0(New_fn, "_se")]][i] <- (DF[[A_se]][i]/DF[[B_e]][i])

        } else {

          if((DF[[A_se]][i]^2)-(DF[[paste0(New_fn, "_e")]][i]^2)*(DF[[B_se]][i]^2)>= 0){ # check for the sign of the value inside of the square root
            DF[[paste0(New_fn, "_se")]][i] <- (1/DF[[B_e]][i])*sqrt((DF[[A_se]][i]^2)-(DF[[paste0(New_fn, "_e")]][i]^2)*(DF[[B_se]][i]^2))
          } else {
            DF[[paste0(New_fn, "_se")]][i] <- (1/DF[[B_e]][i])*sqrt((DF[[A_se]][i]^2)+(DF[[paste0(New_fn, "_e")]][i]^2)*(DF[[B_se]][i]^2))
          }
        }
      } else { # assign 0 value if B_e was 0
        DF[[paste0(New_fn, "_e")]][i] <- 0
        DF[[paste0(New_fn, "_se")]][i] <- 0
      }


    } else { # assign NA if any of the estimates was NA
      DF[[paste0(New_fn, "_e")]][i] <- NA
      DF[[paste0(New_fn, "_se")]][i] <- NA
    }

    DF[[paste0(New_fn, "_e")]][i] <- DF[[paste0(New_fn, "_e")]][i]*100 # switch from proportion to percentage in the estimate value
    DF[[paste0(New_fn, "_se")]][i] <- DF[[paste0(New_fn, "_se")]][i]*100 # switch from proportion to percentage in the MOE value
  }
  return(DF)
}

Dummy <- structure(list(TotPop_e = c(636L, 1287L, 810L, 1218L, 2641L, 
835L, 653L, 1903L, 705L, 570L, 2150L, 6013L, 1720L, 2555L, 1150L, 
2224L, 1805L, 728L, 2098L, 3099L, 4194L, 1909L, 2401L, 1446L, 
1345L, 1573L, 2037L, 634L, 1916L, 1522L, 592L, 831L, 577L, 2196L, 
1482L, 1436L, 1668L, 3095L, 3677L, 2641L, 1285L, 932L, 2461L, 
1609L, 1143L, 1617L, 1075L, 1280L, 838L, 1447L, 3941L, 2402L, 
1130L, 851L, 10316L, 9576L, 2396L, 3484L, 5688L, 2200L, 1856L, 
1441L, 2539L, 3056L, 1325L, 2454L, 2010L, 2340L, 1448L, 2435L, 
2782L, 3633L, 1766L, 2564L, 1473L, 1214L, 1951L, 2561L, 4262L, 
2576L, 4257L, 2314L, 2071L, 3182L, 1839L, 2214L, 1101L, 1898L, 
790L, 867L, 1764L, 970L, 1320L, 2850L, 1019L, 1483L, 3720L, 2215L, 
3581L, 3391L), TotPop_se = c(132.522796352584, 149.544072948328, 
127.051671732523, 130.091185410334, 232.826747720365, 135.562310030395, 
100.303951367781, 176.29179331307, 114.285714285714, 96.6565349544073, 
339.817629179331, 438.297872340425, 245.592705167173, 324.012158054711, 
333.130699088146, 224.924012158055, 321.580547112462, 169.604863221885, 
175.075987841945, 469.908814589666, 375.075987841945, 411.550151975684, 
378.115501519757, 235.258358662614, 241.337386018237, 291.793313069909, 
337.386018237082, 138.601823708207, 145.896656534954, 193.920972644377, 
135.562310030395, 117.325227963526, 244.984802431611, 318.54103343465, 
207.90273556231, 200, 279.635258358663, 657.750759878419, 401.215805471125, 
401.823708206687, 229.787234042553, 139.817629179331, 303.951367781155, 
201.215805471125, 200, 252.887537993921, 356.838905775076, 241.945288753799, 
238.297872340426, 267.477203647416, 320.9726443769, 255.31914893617, 
178.115501519757, 116.109422492401, 891.793313069909, 766.565349544073, 
255.31914893617, 463.22188449848, 448.632218844985, 367.781155015198, 
269.300911854103, 261.398176291793, 286.93009118541, 446.808510638298, 
224.316109422492, 212.158054711246, 233.434650455927, 304.559270516717, 
356.231003039514, 275.379939209726, 330.699088145897, 368.996960486322, 
248.024316109423, 310.030395136778, 153.799392097264, 243.768996960486, 
265.65349544073, 337.386018237082, 436.474164133739, 359.270516717325, 
344.072948328268, 196.960486322188, 231.003039513678, 356.231003039514, 
212.158054711246, 348.328267477204, 206.079027355623, 240.729483282675, 
196.352583586626, 141.033434650456, 215.80547112462, 127.659574468085, 
248.024316109423, 589.057750759878, 231.61094224924, 486.93009118541, 
605.471124620061, 713.06990881459, 488.753799392097, 382.370820668693
), Under17_se = c(35.8095476596307, 50.9877853224243, 50.0994474845873, 
44.7376765786604, 113.994325548832, 59.7386237841673, 22.7862186188344, 
95.1285234870203, 42.3093316505904, 35.4621507988699, 143.021311606928, 
205.334390935311, 102.292167403598, 115.712493289527, 88.9617416652971, 
98.0345650964952, 149.50823698925, 40.0016629212452, 86.7428425216985, 
158.047696828218, 173.225615182675, 144.710221534209, 121.094774232467, 
76.9999466678128, 88.9160360898593, 97.7665610480423, 133.02517642826, 
30.4983051540691, 83.3625069421341, 75.7125713164268, 50.3826325227805, 
37.5622898620679, 7.29483282674772, 122.185425418875, 83.4644035953588, 
63.8384709681463, 99.5458131127046, 208.446825330589, 150.282359742524, 
206.017151858922, 87.7761872483956, 56.194023821941, 120.701992909334, 
50.6423479626955, 55.4225960853081, 93.2888100499867, 126.879946773287, 
143.069104861932, 86.7747884744339, 79.4517480028886, 140.260959630942, 
125.115775875384, 52.187662082273, 38.1819057688564, 365.828168907497, 
380.635956883794, 135.735302000757, 213.321896356121, 198.507936644685, 
126.535797699776, 141.516048792542, 114.238818548927, 117.737122860635, 
165.644292987747, 71.238834852709, 93.0825940979755, 41.8438489710712, 
97.0666682368976, 86.5060758100772, 92.8659724484427, 76.6536183156139, 
192.822109819002, 101.83958502542, 139.341067042001, 55.3992539361667, 
92.106793773051, 78.2330906844691, 115.177918141833, 207.546042154974, 
139.609995160777, 153.568552211039, 73.5738128652025, 112.249861520572, 
171.38868664475, 66.0687084216098, 181.939713349267, 28.4417934718288, 
90.1132509720827, 57.4202669424023, 46.8440239496863, 80.4799857926917, 
42.6875862955885, 81.3500156027725, 142.669475129055, 23.4653605661019, 
191.159072511375, 159.615857998832, 191.592580855392, 184.123292172321, 
125.375425911215), Under17_e = c(123, 284, 189, 228, 661, 180, 
49, 500, 121, 115, 686, 1456, 385, 578, 302, 476, 738, 124, 527, 
803, 1219, 459, 614, 218, 229, 422, 543, 69, 536, 306, 149, 80, 
0, 520, 281, 270, 454, 669, 905, 978, 282, 178, 630, 187, 145, 
367, 327, 577, 225, 246, 966, 629, 211, 65, 2857, 3051, 592, 
1162, 1322, 464, 490, 264, 576, 617, 326, 695, 169, 381, 309, 
476, 355, 915, 431, 869, 269, 358, 335, 650, 1443, 561, 900, 
411, 759, 1265, 171, 833, 45, 255, 134, 144, 339, 203, 388, 413, 
66, 416, 654, 565, 700, 362)), row.names = c(NA, 100L), class = "data.frame")

# example run to calculate pct of people below 17

Dummy <- percent_calculator(Dummy , "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17")

2 Answers2

3

You don't need that loop at all. All your operations are simple arithmetic that can take vectors, instead of single values. This is called vectorization. You then implement your logic tree with a nested ifelse. ifelse does compute all three possible outcomes (which is a bit unnecessary), but that is very much worth it in this case. If you want to optimize further have a look here: Is `if` faster than ifelse?

Timings at the bottom.

percent_calculator_vectorized <- function(DF, A_e, B_e, A_se, B_se, New_fn){
  # arguments legend >> DF = data frame; A_e = estimate_a (string of the fieldname); B_e = estimate_b (string of the fieldname); 
  # A_se = MOE_a (string of the fieldname); B_se = MOE_b (string of the fieldname); New_fn = root for new fieldname in the data frame (string)
  e_name <- paste0(New_fn, "_e")
  se_name <- paste0(New_fn, "_se")

  DF[[e_name]] <- DF[[A_e]] / DF[[B_e]]

  DF[[se_name]] <- ifelse(
    DF[[e_name]] == 1, # check if P==1 to then use the appropriate formula for MOE
    DF[[A_se]] / DF[[B_e]],
    ifelse(
      (DF[[A_se]]^2)-(DF[[e_name]]^2)*(DF[[B_se]]^2)>= 0, # check for the sign of the value inside of the square root
      (1/DF[[B_e]])*sqrt((DF[[A_se]]^2)-(DF[[e_name]]^2)*(DF[[B_se]]^2)),
      (1/DF[[B_e]])*sqrt((DF[[A_se]]^2)+(DF[[e_name]]^2)*(DF[[B_se]]^2))
    )
  )

  # assign 0 value if B_e was 0
  DF[DF[[B_e]] == 0 & !is.na(DF[[B_e]]), c(e_name, se_name)] <- 0
  # switch from proportion to percentage in the estimate value
  DF[, c(e_name, se_name)] <- DF[, c(e_name, se_name)] * 100 

  return(DF)
}


Dummy2 <- percent_calculator(Dummy , "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17")
Dummy3 <- percent_calculator_vectorized(Dummy , "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17")
all.equal(Dummy2, Dummy3) #TRUE

Timings:

bench::mark(
  orig = percent_calculator(Dummy , "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17"),
  vect = percent_calculator_vectorized(Dummy , "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17"),
)
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory    time  gc       
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>    <lis> <list>   
1 orig         17.2ms   18.5ms      53.1   331.2KB     14.8    18     5      339ms <df[,6] … <df[,3] … <bch… <tibble …
2 vect        157.4µs    168µs    5700.     19.4KB     11.6  2450     5      430ms <df[,6] … <df[,3] … <bch… <tibble …

Speed up for this small dataset is ~100x, also with a ~10x smaller memory footprint.

Axeman
  • 32,068
  • 8
  • 81
  • 94
  • Looks promising! I am missing a check for NA values in A_e and B_e. Did you remove it because if either of them are NA, their division would return NA? – Pablo Herreros Cantis Jun 05 '20 at 18:34
  • Yes, R propagates NA's correctly so the division will give NA. – Axeman Jun 05 '20 at 18:37
  • 1
    Amazing. Thank you so much! – Pablo Herreros Cantis Jun 05 '20 at 18:38
  • There are some cases in which I am getting the following error - ```Error in `[<-.data.frame`(`*tmp*`, DF[[B_e]] == 0, e_name, value = 0) : missing values are not allowed in subscripted assignments of data frames In addition: Warning message: In sqrt((DF[[A_se]]^2) - (DF[[e_name]]^2) * (DF[[B_se]]^2)) : Error in `[<-.data.frame`(`*tmp*`, DF[[B_e]] == 0, e_name, value = 0) : missing values are not allowed in subscripted assignments of data frames ``` Could this be due to empty cells? – Pablo Herreros Cantis Jun 05 '20 at 19:31
  • 1
    I think I fixed it, can you check? – Axeman Jun 05 '20 at 19:39
1

I see you have a working solution but for my own reasons I wanted to try a tidyverse solution. This one is just about as fast as the base R solution and for me would be easier to maintain. I also added some more oddities to your toy data to make sure I caught the edge cases.

library(dplyr)

ACS_recalculator <- function(DF, A_e, B_e, A_se, B_se, New_fn){
  e_name <- paste0(New_fn, "_e")
  se_name <- paste0(New_fn, "_se")
  A_ex <- ensym(A_e)
  B_ex <- ensym(B_e)
  A_sex <- ensym(A_se)
  B_sex <- ensym(B_se)

  DF <-
    DF %>% 
    mutate(e_value = ifelse(!!B_ex != 0, !!A_ex / !!B_ex, 0),
           se_value = case_when(
             !!B_ex == 0 ~ 0,
             e_value == 1 ~ !!A_sex / !!B_ex,
             ((!!A_sex)^2) - (e_value^2 * ((!!B_sex)^2)) >= 0 ~ (1/!!B_ex) * sqrt(((!!A_sex)^2) - (e_value^2) * ((!!B_sex)^2)),
             ((!!A_sex)^2) - (e_value^2 * ((!!B_sex)^2)) < 0 ~ (1/!!B_ex) * sqrt(((!!A_sex)^2) + (e_value^2) * ((!!B_sex)^2)),
             TRUE ~ NA_real_),
           e_value = e_value * 100,
           se_value = se_value * 100) %>%
    rename(!!e_name := e_value,
           !!se_name := se_value)

  return(DF)
}

Dummy2 <- ACS_recalculator(Dummy2, "Under17_e", "TotPop_e", "Under17_se", "TotPop_se", "P_Bel17")
head(Dummy2)
#>   TotPop_e TotPop_se Under17_se Under17_e P_Bel17_e P_Bel17_se
#> 1      636  132.5228   35.80955       123  19.33962   3.932255
#> 2     1287  149.5441   50.98779       284  22.06682   3.020104
#> 3      810  127.0517   50.09945       189  23.33333   4.986043
#> 4     1218  130.0912   44.73768       228  18.71921   3.081212
#> 5     2641  232.8267  113.99433       661  25.02840   3.709747
#> 6      835  135.5623   59.73862       180  21.55689   6.239876

Your original example data with more missings and zeros

Dummy2 <- structure(list(TotPop_e = c(636L, 1287L, 810L, 1218L, 2641L, 
                                      835L, 653L, 1903L, 0L, 570L, 2150L, 6013L, 1720L, 2555L, 1150L, 
                                      2224L, 1805L, 728L, 2098L, 3099L, 4194L, 1909L, 2401L, 1446L, 
                                      1345L, 1573L, 2037L, 634L, 1916L, 1522L, 592L, 831L, 577L, 2196L, 
                                      1482L, 1436L, 1668L, 3095L, 3677L, 2641L, 1285L, 932L, 2461L, 
                                      1609L, 1143L, 1617L, 1075L, 1280L, 838L, 1447L, 3941L, 2402L, 
                                      1130L, 851L, 10316L, 9576L, 2396L, 3484L, 5688L, 2200L, 1856L, 
                                      1441L, 2539L, 3056L, 1325L, 2454L, 2010L, 2340L, 1448L, 2435L, 
                                      2782L, 3633L, 1766L, 2564L, 1473L, 1214L, 1951L, 2561L, 4262L, 
                                      2576L, 4257L, 2314L, 2071L, 3182L, 1839L, 2214L, NA, 1898L, 
                                      790L, 867L, 1764L, 970L, 1320L, 2850L, 1019L, 1483L, 3720L, 2215L, 
                                      3581L, 3391L), TotPop_se = c(132.522796352584, 149.544072948328, 
                                                                   127.051671732523, 130.091185410334, 232.826747720365, 135.562310030395, 
                                                                   100.303951367781, 176.29179331307, 114.285714285714, 0, 
                                                                   339.817629179331, 438.297872340425, 245.592705167173, 324.012158054711, 
                                                                   333.130699088146, 224.924012158055, 321.580547112462, 169.604863221885, 
                                                                   175.075987841945, 469.908814589666, 375.075987841945, 411.550151975684, 
                                                                   378.115501519757, 235.258358662614, 241.337386018237, 291.793313069909, 
                                                                   337.386018237082, 138.601823708207, 145.896656534954, 193.920972644377, 
                                                                   135.562310030395, 117.325227963526, 244.984802431611, 318.54103343465, 
                                                                   207.90273556231, 200, 279.635258358663, 657.750759878419, 401.215805471125, 
                                                                   401.823708206687, 229.787234042553, 139.817629179331, 303.951367781155, 
                                                                   201.215805471125, 200, 252.887537993921, 356.838905775076, 241.945288753799, 
                                                                   238.297872340426, 267.477203647416, 320.9726443769, 255.31914893617, 
                                                                   178.115501519757, 116.109422492401, NA, 766.565349544073, 
                                                                   255.31914893617, 463.22188449848, 448.632218844985, 367.781155015198, 
                                                                   269.300911854103, 261.398176291793, 286.93009118541, 446.808510638298, 
                                                                   224.316109422492, 212.158054711246, 233.434650455927, 304.559270516717, 
                                                                   356.231003039514, 275.379939209726, 330.699088145897, 368.996960486322, 
                                                                   248.024316109423, 310.030395136778, 153.799392097264, 243.768996960486, 
                                                                   265.65349544073, 337.386018237082, 436.474164133739, 359.270516717325, 
                                                                   344.072948328268, 196.960486322188, 231.003039513678, 356.231003039514, 
                                                                   212.158054711246, 348.328267477204, 206.079027355623, 240.729483282675, 
                                                                   196.352583586626, 141.033434650456, 215.80547112462, 127.659574468085, 
                                                                   248.024316109423, 589.057750759878, 231.61094224924, 486.93009118541, 
                                                                   605.471124620061, 713.06990881459, 488.753799392097, 382.370820668693
                                      ), Under17_se = c(35.8095476596307, 50.9877853224243, 50.0994474845873, 
                                                        44.7376765786604, 113.994325548832, 59.7386237841673, 22.7862186188344, 
                                                        95.1285234870203, 42.3093316505904, 35.4621507988699, 143.021311606928, 
                                                        205.334390935311, 102.292167403598, 115.712493289527, 88.9617416652971, 
                                                        98.0345650964952, 149.50823698925, 40.0016629212452, 86.7428425216985, 
                                                        158.047696828218, 173.225615182675, 144.710221534209, 121.094774232467, 
                                                        76.9999466678128, 88.9160360898593, 97.7665610480423, 133.02517642826, 
                                                        30.4983051540691, 83.3625069421341, 75.7125713164268, 50.3826325227805, 
                                                        37.5622898620679, 7.29483282674772, 122.185425418875, 83.4644035953588, 
                                                        63.8384709681463, 99.5458131127046, 208.446825330589, 150.282359742524, 
                                                        206.017151858922, 87.7761872483956, 56.194023821941, 120.701992909334, 
                                                        50.6423479626955, 55.4225960853081, 93.2888100499867, 126.879946773287, 
                                                        143.069104861932, 86.7747884744339, 79.4517480028886, 140.260959630942, 
                                                        125.115775875384, 52.187662082273, 38.1819057688564, 365.828168907497, 
                                                        380.635956883794, 135.735302000757, 213.321896356121, 198.507936644685, 
                                                        126.535797699776, 141.516048792542, 114.238818548927, 117.737122860635, 
                                                        165.644292987747, 71.238834852709, 93.0825940979755, 41.8438489710712, 
                                                        97.0666682368976, 86.5060758100772, 92.8659724484427, 76.6536183156139, 
                                                        192.822109819002, 101.83958502542, 139.341067042001, 55.3992539361667, 
                                                        92.106793773051, 78.2330906844691, 115.177918141833, 207.546042154974, 
                                                        139.609995160777, 153.568552211039, 73.5738128652025, 112.249861520572, 
                                                        171.38868664475, 66.0687084216098, 181.939713349267, 28.4417934718288, 
                                                        90.1132509720827, 57.4202669424023, 46.8440239496863, 80.4799857926917, 
                                                        42.6875862955885, 81.3500156027725, 142.669475129055, 23.4653605661019, 
                                                        191.159072511375, 159.615857998832, 191.592580855392, 184.123292172321, 
                                                        125.375425911215), Under17_e = c(123, 284, 189, 228, 661, 180, 
                                                                                         49, 500, 121, 115, 686, 1456, 385, 578, 302, 476, 738, 124, 527, 
                                                                                         803, 1219, 459, 614, 218, 229, 422, 543, 69, 536, 306, 149, 80, 
                                                                                         0, 520, 281, 270, 454, 669, 905, 978, 282, 178, 630, 187, 145, 
                                                                                         367, 327, 577, 225, 246, 966, 629, 211, 65, 2857, 3051, 592, 
                                                                                         1162, 1322, 464, 490, 264, 576, 617, 326, 695, 169, 381, 309, 
                                                                                         476, 355, 915, 431, 869, 269, 358, 335, 650, 1443, 561, 900, 
                                                                                         411, 759, 1265, 171, 833, 45, 255, 134, 144, 339, 203, 388, 413, 
                                                                                         66, 416, 654, 565, 700, 362)), row.names = c(NA, 100L), class = "data.frame")
Chuck P
  • 3,862
  • 3
  • 9
  • 20