Run a function on an N length subset of dataframe and store results into list

Question

I am using the depmixS4 package to train HMM on data. However, my dataset is not continuous (I have data for Saturdays and Sundays, back to back but the number of observations in each week vary). For example, my dataset might look like this:

1   Saturday  Evening             16.2  235.84
2   Saturday  Evening             23.4  235.29
3   Saturday  Evening             29.4  232.79
4   Sunday   Evening             24.2  233.89
5   Sunday   Evening             24.2  233.66
6   Sunday   Evening             24.2  233.38
7   Sunday   Evening             24.2  232.99
8   Sunday   Evening             25.4  233.21
9   Sunday   Evening             26.8  232.37
10  Saturday    Night            25.6  231.55
11  Saturday     Night           24.4  231.19
12  Saturday     Night           24.4  231.63
13  Saturday     Night           24.4  231.71
14  Sunday     Night             25.2  231.23
15  Sunday     Night             25.2  231.23
14  Saturday     Night             25.2  231.23
15  Saturday    Night             25.2  231.23
15  Sunday    Night             25.2  231.23

df = structure(list(V2 = c("Saturday", "Saturday", "Saturday", "Sunday", 
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Saturday", 
"Saturday", "Saturday", "Saturday", "Sunday", "Sunday", "Saturday", 
"Saturday", "Sunday"), V3 = c("Evening", "Evening", "Evening", 
"Evening", "Evening", "Evening", "Evening", "Evening", "Evening", 
"Night", "Night", "Night", "Night", "Night", "Night", "Night", 
"Night", "Night"), V4 = c(16.2, 23.4, 29.4, 24.2, 24.2, 24.2, 
24.2, 25.4, 26.8, 25.6, 24.4, 24.4, 24.4, 25.2, 25.2, 25.2, 25.2, 
25.2), V5 = c(235.84, 235.29, 232.79, 233.89, 233.66, 233.38, 
232.99, 233.21, 232.37, 231.55, 231.19, 231.63, 231.71, 231.23, 
231.23, 231.23, 231.23, 231.23)), .Names = c("V2", "V3", "V4", 
"V5"), row.names = c(NA, -18L), class = "data.frame")

In the example, set 1 has 9 observations, set 2 has 6 observations, set 3 has 3 observations. I already have a list containing the number of observations for these sets, sequentially : [9, 6, 3]. I want to use the list to subset this part of the data, pass it into the depmix function, fit the model, and store the fitted model's log-likelihood results into a list, using a for loop.

For example :

set.seed(1)
mod[i] <- depmix(list(V4~1, V5~1), data = dataset[i], nstates=10, family=list(gaussian(),  gaussian()))
fm[i] <- fit(mod[i])
append(resultList, fm[i])

#Where [i] is the iteration of the loop, and dataset[i] corresponds to the i'th subset of length N corresponding to the i'th element in the list (in the example, the list is [9,6,3])

I realize this is kind of asking 2 questions, one for subsetting a dataframe using a list and the other running the functions and inserting the results into a list.

For your first, do you mean something like `split(df, rep(1:3, times=c(9,6,3)))`? — r2evans, Mar 30 '18 at 00:14
@r2evans Yes! That works. How can I use these splits for the for loop to run the second set of code i amount of times? — John Kim, Mar 30 '18 at 00:17
`lapply` or `sapply(..., simplify=FALSE)` are my preferred methods, but other ways exist depending on what else you need with it. You might find this helpful: https://stackoverflow.com/a/24376207/3358272 — r2evans, Mar 30 '18 at 00:24

score 0 · Answer 1 · answered Mar 30 '18 at 01:49

As r2evans suggested I split the df like so :

split_data = split(df, rep(1:3, times=c(9,6,3)))

then I used sapply on a function I created to run the function multiple times :

runIndividualWeeks <- function(data_input){
    mod = depmix(list(V3~1, V4~1), data = data_input, nstates=10, family=list(gaussian(), gaussian()))
    fit(temp, verbose = FALSE)
}

results <- sapply(split_data, runIndividualWeeks)

score 0 · Answer 2 · answered Mar 30 '18 at 02:15

Consider by (object-oriented wrapper of tapply), the often underused split-apply function that runs both in one call to return a list of your function's return.

# NEW GROUP COLUMN
df$grp <- rep(1:3, times=c(9,6,3))

runIndividualWeeks <- function(data_input){
    mod = depmix(list(V3~1, V4~1), data = data_input, nstates=10, family=list(gaussian(), gaussian()))
    fit(temp, verbose = FALSE)
}

results <- by(df, df$grp, runIndividualWeeks)

Run a function on an N length subset of dataframe and store results into list

2 Answers2