speed up R apply on data frame

Question

I have a horse racing dataset. For each horse racing record, if the track value is not missing, I want to count the number of horse win in the past two years with the same venue, track, and similar distance. I use apply to loop each row， I only want to add a new column count previous win to the original data set, so the output should have one more column and same row length as the input given. But the speed is very slow. How can I speed up the loop?

rdate: horse racing year-month-date. venue: ST, HV. track: TURF, All WEATHER TRACK. distance: 1200, 1400, 1600, 1800, etc. ind_win: 0(horse did not win 1st place), 1(horse win 1st place).

structure(list(rdate = structure(c(17450, 17475, 17481, 17496, 
17510, 17517, 17532, 17566, 17593, 17615, 17629, 17657, 17667, 
17796, 17817, 17839, 17856, 17860, 17881, 17881, 17902, 17902
), class = "Date"), venue = c("HV", "ST", "ST", "ST", "ST", "ST", 
"ST", "ST", "ST", "ST", "ST", "ST", "HV", "ST", "ST", "ST", "HV", 
"ST", "ST", "ST", "ST", "ST"), track = c("TURF", "TURF", "TURF", 
"TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", 
"TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", 
"TURF", "TURF", "TURF"), horsenum = c("A366", "A366", "A366", 
"A366", "A366", "A366", "A366", "A366", "A366", "A366", "A366", 
"A366", "A366", "B440", "B440", "B440", "A366", "B440", "A366", 
"B440", "A366", "B440"), distance = c(1800L, 1800L, 1600L, 1600L, 
1800L, 1600L, 1800L, 1800L, 1800L, 1600L, 1800L, 2000L, 1800L, 
1200L, 1400L, 1400L, 1650L, 1400L, 1600L, 1400L, 1800L, 1400L
), ind_win = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L)), row.names = c(NA, -22L
), class = "data.frame")

library(tidyverse)
library(lubridate)

HWinCountF <- function(df){
    if (!is.na(df["track"])) {
      tmp <- subset(jc.data, horsenum == df["horsenum"] & rdate < df["rdate"] & rdate > ymd(df["rdate"]) - years(2) &
                      venue == df["venue"] & track==df["track"] &  distance>=as.integer(df["distance"])-200 &
                      distance<=as.integer(df["distance"])+200)
      if (nrow(tmp) > 0) {
        return(nrow(tmp[tmp$ind_win == 1,]))
      } else {
        return(NA)
      }
    } else {
      return(NA)
    }
  }

  jc.data['h_win_count'] <- apply(jc.data, 1, HWinCountF)

[1. If you are working with data.frames, forget there is a function called apply- whatever you do - *don't* use it.](https://stackoverflow.com/users/3001626/david-arenburg) — Parfait, Oct 04 '19 at 14:08
It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Oct 04 '19 at 14:19
Same as @Parfait: https://privefl.github.io/blog/why-i-rarely-use-apply/ — F. Privé, Oct 04 '19 at 15:09

Parfait · Accepted Answer · 2019-10-04T14:33:13.623

Per OP's needs

I want to count the number of horse win in the past two years with the same venue, track, and similar distance

As this is straighforward aggregation, avoid looping and consider merge with subset of data frame on itself since you appear to need to compare observations with each other. Then run aggregate for horse wins. Below runs with posted data sample.

# MERGE BY COMMON VARIABLES AND SUBSET RESULTS BY DATE AND DISTANCE
compare_df <- subset(merge(jc.data, jc.data, by=c("horsenum", "venue", "track")),
                     rdate.x < rdate.y &
                     rdate.x > lubridate::ymd(rdate.y) - lubridate::years(2) &
                     distance.x >= as.integer(distance.y) - 200 &
                     distance.x <= as.integer(distance.y) + 200 
              )

# SUM ind_win GROUPED BY COMMON VARIABLES
agg_df <- aggregate(cbind(h_win_count = ind_win.x) ~ horsenum + venue + track, 
                    data = compare_df, FUN=sum)

agg_df
#   horsenum venue track h_win_count
# 1     A366    HV  TURF           0
# 2     A366    ST  TURF           0
# 3     B440    ST  TURF           2

thank you for help, I run your code but I only want to add a new column count previous win count to the original data set, so the output should have the same variables and row length as the input given and one more column count — tiezhuetc, Oct 04 '19 at 14:56
Simply `merge` *agg_df* to original data frame for new column: `final_df <- merge(jc.data, agg_df, by=c("horsenum", "venue", "track"))` — Parfait, Oct 04 '19 at 15:35

speed up R apply on data frame

1 Answers1