Iterating through a multi-dimensional dataset in R

Question

I suck at loops and [l,s,v]apply, and I need to summarize a large longitudinal dataset. I have thoroughly searched about my question(s) for days and I am posting this because I cannot solve my problem.

The data looks something like this:

id var1_dose var1_unit var2_dose var2_unit var3_dose var3_unit
1         2        mL         5        mL         1        mL
2         4        mg         2        mg         4        mg
3         6       mcg         4       mcg         2       mcg
1         1        mL         1        mL         3        mL
2         3        mL         3        mL         5        mL

Question 1

I want to append through the var(1-3) doses using apply or loops and output as a list.

Question 2

I want to create a new variable to concatenate the dose and unit by the number e.g. in row 1, var1 = 2mL, var2 = 5 mL, var3 = 1mL

Your help is greatly appreciated.

Related post http://stackoverflow.com/questions/23945350 – zx8754 Nov 29 '16 at 08:17 — zx8754, Nov 29 '16 at 08:17

score 5 · Answer 1 · edited Nov 29 '16 at 05:09

5

This is not the answer you asked for but I think it is helpful for you and others. Consider instead tidying your data frame. For example:

library(tidyr)

df1 <- data.frame(id = c(1,2,3,1,2),
       var1_dose = c(2,4,6,1,3),
       var1_unit = c("mL", "mg", "mcg", "mL", "mL"),
       var2_dose = c(5,2,4,1,3),
       var2_unit = c("mL", "mg", "mcg", "mL", "mL"),
       var3_dose = c(1,4,2,3,5),
       var3_unit = c("mL", "mg", "mcg", "mL", "mL"),
       stringsAsFactors = FALSE)

df1.gather <- gather(df1, variable, value, -id)
df1.tidy <- separate(df1.gather, variable,
            into = c("variable", "measurement"), sep = "_")

head(df1.tidy)

# id variable measurement value
# 1      var1        dose     2
# 2      var1        dose     4
# 3      var1        dose     6
# 1      var1        dose     1
# 2      var1        dose     3
# 1      var1        unit    mL
# 2      var1        unit    mg
# 3      var1        unit   mcg
# 1      var1        unit    mL
# 2      var1        unit    mL

# hacky workaround to get a units column
df1.tidy <- subset(df1.tidy, measurement != "unit")
df1.tidy$unit <- rep(c("mL", "mg", "mcg", "mL", "mL"), 3)

This structure should make summary, modelling and plotting (with ggplot2) much easier.

edited Nov 29 '16 at 05:09

Maiasaura

32,226
27
104
108

answered Nov 29 '16 at 03:40

neilfws

32,751
5
50
63

I think keeping the two separate columns for dose and unit makes more sense in this case. That makes it vary easy for him to concatenate the columns to get the dose + unit for each variable. It's not so easy with your longer format. – CoderGuy123 Nov 29 '16 at 04:11
Yes, it would be better not to have the "units for units" in the value column; that was my first basic attempt! I'll refine it and edit later. – neilfws Nov 29 '16 at 04:29
1

Simpler result, more complicated path: `library(tidyverse) ; df1 %>% gather(var_dose, dose, matches('dose')) %>% gather(var_unit, unit, matches('unit')) %>% mutate_at(vars(matches('var')), parse_number) %>% filter(var_dose == var_unit) %>% select(var = var_dose, everything(), -var_unit)` Or steal `data.table::melt`'s multi-gather capabilities till Hadley implements it in tidyr: `library(data.table) ; melt(setDT(df1), id = 1, measure.vars = patterns('var._dose', 'var._unit'), variable.factor = FALSE, value.name = c('dose', 'unit'))[, variable := as.integer(variable)][]` – alistaire Nov 29 '16 at 05:40
@alistaire this is not a comment but an answer :) – zx8754 Nov 29 '16 at 07:36
As I understand it, the way Hadley wants people to do multi-gather (like with base-r `reshape`) is to first `gather` all the columns into one very long format df. Then use `separate`/`spread` to split them up to be a little wider. It's part of his underlying unix philosophy 'each function should do one simple thing and do it well'. Alistaire's method of 2x partial `gather`s works too, but involves having to filter rows batched on the matched variables. A problem that arises due to the 'cartesian join' resulted from the use of two sequential `gather`s. – CoderGuy123 Nov 30 '16 at 17:09

score 2 · Answer 2 · answered Nov 29 '16 at 02:50

Perhaps, this helps

v1 <- unique(sub("_.*", "", names(df1)[-1]))
df1[paste0(v1, "dose_unit")] <- lapply(v1, function(x)
                         do.call(paste0, df1[grep(x, names(df1))]))
df1
#  id var1_dose var1_unit var2_dose var2_unit var3_dose var3_unit var1dose_unit var2dose_unit var3dose_unit
#1  1         2        mL         5        mL         1        mL           2mL           5mL           1mL
#2  2         4        mg         2        mg         4        mg           4mg           2mg           4mg
#3  3         6       mcg         4       mcg         2       mcg          6mcg          4mcg          2mcg
#4  1         1        mL         1        mL         3        mL           1mL           1mL           3mL
#5  2         3        mL         3        mL         5        mL           3mL           3mL           5mL

Thanks so much! I had a question about the "_." replace. I encountered some trouble with the underscore and I was wondering how it affects loops. If you could shed some light, that would be great. — kquach, Nov 29 '16 at 03:31
@kquach I was just automating it to get 'var1', 'var2', 'var3' , in case you many vars. The idea is to match the `_` followed by any characters and replace it with "". So that part gets stripped off from the names to create the vector v1.. Here we are looping by each element of 'v1' and then grep those to find the columns that match it, As we need to paste the unit with dose for each vars, this does the subset — akrun, Nov 29 '16 at 04:19

score 1 · Answer 3 · answered Nov 29 '16 at 03:01

@akrun gave the correct answer. In case you want the result as a list-

df <- data.frame(id = c(1,2,3,1,2), var1_dose = c(2,4,6,1,3),
             var1_unit = c("mL","mg","mcg","mL","mL"), var2_dose = c(5,2,4,1,3),
             var2_unit = c("mL","mg","mcg","mL","mL"), var3_dose = c(1,4,2,3,5),
             var3_unit = c("mL","mg","mcg","mL","mL"))

dose_list <- lapply(seq(2,ncol(df)-1,2), function(x) paste0(df[, x],df[, x + 1]))
names(dose_list) <- c(paste0("dose_",seq(1:(ncol(df) / 2))))

score 0 · Answer 4 · answered Nov 29 '16 at 15:31

To expand on my comment on @neilfws's answer (and use his sample data), your data is best shaped in a longer form, which will make the rest of your analysis much easier. However, your data is currently in a wide form such that you have two sets of columns that need to be gathered (melted), which takes a bit more work than your average wide-to-long reshape.

One option is to gather everything and then spread back to wide. This works well, with the one issue that tidyr::spread is very particular about indices, so you'll have to add a second ID column to identify the rows of the result, and making that column takes a little thought.

library(tidyverse)

df1_tidy <- df1 %>% 
    gather(var, val, -id) %>%    # gather everything to long form
    separate(var, c('var', 'var2')) %>%    # separate "var*" from dose/unit
    group_by(var2) %>% 
    mutate(var = parse_number(var),    # extract var to integer
           id2 = seq(n())) %>%    # add ID column for spreading
    spread(var2, val, convert = TRUE) %>% 
    select(-id2)    # cleanup

## # A tibble: 15 × 4
##       id   var  dose  unit
## *  <dbl> <dbl> <int> <chr>
## 1      1     1     2    mL
## 2      1     1     1    mL
## 3      1     2     5    mL
## 4      1     2     1    mL
## 5      1     3     1    mL
## 6      1     3     3    mL
## 7      2     1     4    mg
## 8      2     1     3    mL
## 9      2     2     2    mg
## 10     2     2     3    mL
## 11     2     3     4    mg
## 12     2     3     5    mL
## 13     3     1     6   mcg
## 14     3     2     4   mcg
## 15     3     3     2   mcg

Alternately, you can gather each set separately. The issue with this approach is that it will give you combinations you don't want (var1 and var3, etc.), so you'll have to filter back to the originals.

df1_tidy <- df1 %>% 
    gather(var, dose, contains('dose')) %>% 
    gather(var_unit, unit, contains('unit')) %>% 
    mutate_at(vars(contains('var')), parse_number) %>%    # extract var numbers
    filter(var == var_unit) %>%    # filter to matching combinations
    select(-var_unit)    # cleanup

df1_tidy
##    id var dose unit
## 1   1   1    2   mL
## 2   2   1    4   mg
## 3   3   1    6  mcg
## 4   1   1    1   mL
## 5   2   1    3   mL
## 6   1   2    5   mL
## 7   2   2    2   mg
## 8   3   2    4  mcg
## 9   1   2    1   mL
## 10  2   2    3   mL
## 11  1   3    1   mL
## 12  2   3    4   mg
## 13  3   3    2  mcg
## 14  1   3    3   mL
## 15  2   3    5   mL

While tidyr does not (yet) have a multi-gather function, requiring approaches like the above, data.table's version of melt allows you to pass its measure.vars parameter regex patterns, enabling multi-gather. The grammar looks a lot different and has different things to clean up, but it takes you to the same place:

library(data.table)

dt1 <- melt(setDT(df1), 
            measure.vars = patterns('dose', 'unit'),    # set gathering patterns
            variable.factor = FALSE,    # because factor numbers are evil
            value.name = c('dose', 'unit'))    # set column names

dt1 <- dt1[, variable := as.integer(variable)][]    # cleanup

dt1
##     id variable dose unit
##  1:  1        1    2   mL
##  2:  2        1    4   mg
##  3:  3        1    6  mcg
##  4:  1        1    1   mL
##  5:  2        1    3   mL
##  6:  1        2    5   mL
##  7:  2        2    2   mg
##  8:  3        2    4  mcg
##  9:  1        2    1   mL
## 10:  2        2    3   mL
## 11:  1        3    1   mL
## 12:  2        3    4   mg
## 13:  3        3    2  mcg
## 14:  1        3    3   mL
## 15:  2        3    5   mL

Regardless of the approach you choose, once your data is tidy, combining dose and unit is easy:

# base R
df1_tidy$dose_unit <- paste0(df1_tidy$dose, df1_tidy$unit)

# tidyverse
df1_tidy <- df1_tidy %>% mutate(dose_unit = paste0(dose, unit))

# data.table
dt1 <- dt1[, dose_unit := paste0(dose, unit)][]

Iterating through a multi-dimensional dataset in R

Question 1

Question 2

4 Answers4