0

I am extracting data for specific geographical areas and specific indicators from the public health agency in the UK using a package they developed for pulling data from their API called fingertipsR, and then inserting them in to an empty list, where the list consists of lists (geographies) which contain lists representing each indicator.

geog <- c("E38000220", "E38000046", "E38000144", "E38000191", "E38000210", 
"E38000038", "E38000164", "E38000195", "E38000078", "E38000139", 
"E38000166", "E38000211", "E38000147", "E38000183", "E38000028", 
"E38000053", "E38000126", "E38000153", "E38000173", "E38000175"
)
indicators <- c(241, 92588, 90672, 90692, 90697, 90698, 90701, 90702, 91238, 
90690, 90694, 93245, 93246, 93244, 93247, 93248, 93049, 93047, 
90700)

## install.packages("fingertipsR"); library(fingertipsR)
library(dplyr)

list <- list()

start <- Sys.time()
for (geog_group in geog) {
    for (indicator_number in indicators) {
    list[[geog_group]][[as.character(indicator_number)]] <- fingertips_data(IndicatorID = indicator_number, AreaTypeID = c(152, 153, 154)) %>% 
      filter(AreaCode == geog_group, TimeperiodSortable == max(TimeperiodSortable)) %>% 
      select(Timeperiod, Value) %>% distinct()
  }
}
end <- Sys.time()
end-start

On my work laptop, this takes around 15 minutes to execute - I'm wondering if there are any easy ways to optimise this code - possibly with lapply or purrr?

Edit: Ideally I want the indicators for each geographical area to be in one data frame, as they all share the same columns Time period and Value - I was going to deal with that after unlist() or something similar - but if anyone has ways to solve that inside the for loop I'm open to suggestions.

Nautica
  • 2,004
  • 1
  • 12
  • 35

1 Answers1

1

Here is a more condensed loop (taking roughly 25 sec.)

result_list <- list(length(indicators))
for (k in seq_along(indicators)) {
  ind   <- indicators[k]
  # load the data once per indicator
  tmpDF <- fingertips_data(IndicatorID = ind, AreaTypeID = 152:154)
  # retrieve the rows corresp. to max per geog
  out <- t(vapply(seq_along(geog), function (s) {
    row_geog <- which(.subset2(tmpDF, which(names(tmpDF) == 'AreaCode')) == geog[s])
    row_max <- which.max(.subset2(tmpDF, which(names(tmpDF) == 'TimeperiodSortable'))[row_geog])
    res <- tmpDF[row_geog,c("Timeperiod","Value")][row_max,]
    res <- c(Timeperiod = res$Timeperiod, Value = res$Value)
    if (length(res) == 0) res <- c(Timeperiod = NA_character_, Value = NA_character_)
    return (res)
  }, character(2)))
  # save result for indicator[k]
  result_list[[k]] <- data.frame(indicator = ind, geog, 
                                 Timeperiod = out[,1], 
                                 Value = as.numeric(out[,2]),
                                 stringsAsFactors = FALSE)  
}

I am not really familiar with fingertipsR but it seems to get the job done (correct me if I'm wrong), here is the first element of the result:

head(result_list[[1]])
# indicator      geog Timeperiod    Value
# 1       241 E38000220    2017/18 8.214912
# 2       241 E38000046    2017/18 7.907130
# 3       241 E38000144    2017/18 9.139239
# 4       241 E38000191    2017/18 8.891195
# 5       241 E38000210    2017/18 8.311592
# 6       241 E38000038    2017/18 6.653444

Changes

Here is what I changed from your version:

  • I wrote only one for loop (strictly speaking, there are still two loops in my version as I am using vapply) iterating over the indicators. The main reason wasn't even to have less looping (which in itself already is motivation enough) but to call the function fingertips_data as few times as possible: those function calls are very slow and do not depend on geog, only the subsetting does.
  • So for every indicator the function fingertips_data is called once and then using vapply the subsetting and finding the max is done
  • As a result, the output has a slightly different format but essentially it contains the same information
niko
  • 5,253
  • 1
  • 12
  • 32
  • This is great, thanks. My R skills still have a long way to go before I can write something like this off the top of my head. – Nautica Jan 02 '19 at 13:12
  • I'm curious about this part: `for (k in seq_along(indicators)) { ind <- indicators[k]...` seems to be same as `for (indicator_number in indicators){..`, but I'm guessing the former is usually more computationally efficient? – Nautica Jan 02 '19 at 14:51
  • 1
    @VY Force of habits haha In this case yours is actually a little faster and cleaner as well. I used `seq_along` as it often offers better performance, not in this case. I'd probably stick to your version (primarily for readability). – niko Jan 02 '19 at 15:53