Writing a function that restructures data so that each ID has only one row using data table

Question

My question builds on the data table answer to this question (full disclosure: linked question was also asked by me). I have benefitted greatly from other SO questions and answers as well, and I've spent a lot of time reading about functions but haven't succeeded yet.

I've got a few lines of code that work well for my purposes, but I have to run the same code for 5 different variables. Therefore, I would like to write a function to make this process more efficient.

Sample data frame:

    id <- c(1, 1, 1, 1, 2, 3, 4, 4, 5, 5, 5)
    bmi <- c(18, 22, 23, 23, 20, 38, 30, 31, 21, 22, 24)
    other_data <- c("north_africa", "north_africa", "north_africa", "north_africa", "western_europe", "south_america", "eastern_europe", "eastern_europe", "ss_africa", "ss_africa", "ss_africa")
    other_data2 <- c(0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0)

    big_df <- data.frame(id, bmi, other_data, other_data2)


    #first make a data table with just the id and bmi columns
    bmi_dt <- as.data.table(big_df[c(1, 2)])

    #restructure data so that each ID only has one row
    bmi_dt <- bmi_dt[, c(bmi_new = paste(bmi, collapse = "; "), .SD), by = id][!duplicated(bmi_dt$id)]

    #split the strings of multiple numbers into 4 new cols
    bmi_dt[, c("bmi1", "bmi2", "bmi3", "bmi4") := tstrsplit(as.character(bmi_new), "; ", fixed=TRUE)]

    #make columns numeric
    bmi_dt <- bmi_dt[, lapply(.SD, as.numeric), by = id]

    #function to replace NA with 0 in a data table
    func_na <- function(DT) {
       for (i in names(DT))
          DT[is.na(get(i)), i:=0, with=FALSE]
    }

    func_na(bmi_dt)

That last part, the function, was written by Matt Dowle in this SO answer.

I have been trying to create an overall function for this sequence by starting small, but even the most basic part won't work properly. This is one of my failed attempts:

    big_func <- function(DT, old_col, id_col) {
      DT <- DT[, c(new_col = paste(old_col, collapse = "; "), .SD), by = id_col][!duplicated(id_col)]
      DT
    }  

    test <- big_func(bmi_dt, bmi, id)

I'd really like to understand:

a) Why doesn't my attempt work for the first part?

b) Does it make sense to create one large function for all of this?

c) If so, how do I do that?

Edit: I see now that there is a good question about reshaping data tables here. I think my question about writing functions is a separate issue.

Just wanted to let you know the title of your question is quite meaningless, example is not great too, it doesn't have desired output and it is not clear what you are asking. The `..., by = id_col][!duplicated(id_col)]` is very different than `..., by = id][!duplicated(bmi_dt$id)]`, the first won't have any affect, it uses locally available (already after `by`) value of `id_col` while the latter uses its original field non-processed with `by`. — jangorecki, Feb 25 '16 at 01:08

score 1 · Accepted Answer · answered Feb 25 '16 at 09:18

1

You can avoid all this pasting/spliting/converting/replacing by:

library(data.table)

big_dt <- as.data.table(big_df)
big_dt[, id_bmi := 1:.N, by = id]
dcast(big_dt[, list(id, id_bmi, bmi)], id ~ id_bmi, value.var = 'bmi', fill = 0)

answered Feb 25 '16 at 09:18

danas.zuokas

4,551
4
29
39

Is it possible to scale this out for multiple variables at once? Why is the "list" portion needed in the last command? – epi_n00b Feb 25 '16 at 18:21
What do you mean by scaling out for multiple variables? Inside `list` one specifies variables that are taken, so `big_dt[, list(id, id_bmi, bmi)]` is a data table with three variables. – danas.zuokas Feb 25 '16 at 19:48
What I meant to ask was how would I need to edit the line `big_dt[, id_bmi := 1:.N, by = id]` if I had another variable for which I wanted to repeat the same process, for example, the variable "other_data2" from my original sample data frame? Is it possible to expand on the code you have written to use dcast for multiple variables without doing any copying and pasting? – epi_n00b Feb 25 '16 at 21:07
1

If I understood correctly: `big_dt[, id2 := 1:.N, by = id]` and `dcast(big_dt[, list(id, id2, var)], id ~ id2, value.var = 'var', fill = 0)`. – danas.zuokas Feb 26 '16 at 07:17

Writing a function that restructures data so that each ID has only one row using data table

1 Answers1