My question builds on the data table answer to this question (full disclosure: linked question was also asked by me). I have benefitted greatly from other SO questions and answers as well, and I've spent a lot of time reading about functions but haven't succeeded yet.
I've got a few lines of code that work well for my purposes, but I have to run the same code for 5 different variables. Therefore, I would like to write a function to make this process more efficient.
Sample data frame:
id <- c(1, 1, 1, 1, 2, 3, 4, 4, 5, 5, 5)
bmi <- c(18, 22, 23, 23, 20, 38, 30, 31, 21, 22, 24)
other_data <- c("north_africa", "north_africa", "north_africa", "north_africa", "western_europe", "south_america", "eastern_europe", "eastern_europe", "ss_africa", "ss_africa", "ss_africa")
other_data2 <- c(0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0)
big_df <- data.frame(id, bmi, other_data, other_data2)
#first make a data table with just the id and bmi columns
bmi_dt <- as.data.table(big_df[c(1, 2)])
#restructure data so that each ID only has one row
bmi_dt <- bmi_dt[, c(bmi_new = paste(bmi, collapse = "; "), .SD), by = id][!duplicated(bmi_dt$id)]
#split the strings of multiple numbers into 4 new cols
bmi_dt[, c("bmi1", "bmi2", "bmi3", "bmi4") := tstrsplit(as.character(bmi_new), "; ", fixed=TRUE)]
#make columns numeric
bmi_dt <- bmi_dt[, lapply(.SD, as.numeric), by = id]
#function to replace NA with 0 in a data table
func_na <- function(DT) {
for (i in names(DT))
DT[is.na(get(i)), i:=0, with=FALSE]
}
func_na(bmi_dt)
That last part, the function, was written by Matt Dowle in this SO answer.
I have been trying to create an overall function for this sequence by starting small, but even the most basic part won't work properly. This is one of my failed attempts:
big_func <- function(DT, old_col, id_col) {
DT <- DT[, c(new_col = paste(old_col, collapse = "; "), .SD), by = id_col][!duplicated(id_col)]
DT
}
test <- big_func(bmi_dt, bmi, id)
I'd really like to understand:
a) Why doesn't my attempt work for the first part?
b) Does it make sense to create one large function for all of this?
c) If so, how do I do that?
Edit: I see now that there is a good question about reshaping data tables here. I think my question about writing functions is a separate issue.