1

I have a list of data frames, where every data frame is similar (has the same columns with the same names) but contains information on a different, related "thing" (say, species of flower). I need an elegant way to re-categorize one of the columns in all of these data frames from continuous to categorical using the function cut(). The problem is each "thing" (flower) has different cut-points and will use different labels.

I got as far as putting the cut-points and labels in a separate list. If we're following my fake example, it basically looks like this:

iris <- iris 
peony <- iris  #pretending that this is actually different data!
flowers <- list(iris = iris, peony = peony)

params <- list(iris_param = list(cutpoints = c(1, 4.5),
                             labels = c("low", "medium", "high")),

           peony_param = list(cutpoints = c(1.5, 2.5, 5),
                              labels = c("too_low", "kinda_low", "okay", "just_right")))

#And we want to cut 'Sepal.Width' on both peony and iris

I am now really stuck. I have tried using some combinations of lapply() and do.call() but I'm kind of just guessing (and guessing wrong).

More generalized, I want to know: how can I use a changing set of arguments to apply a function over different data frames in a list?

HFBrowning
  • 2,196
  • 3
  • 23
  • 42

2 Answers2

3

I think this is a great time for a for loop. It's straightforward to write and clear:

for (petal in seq_along(flowers)) {
    flowers[[petal]]$Sepal.Width.Cut = cut(
        x = flowers[[petal]]$Sepal.Width,
        breaks = c(-Inf, params[[petal]]$cutpoints, Inf),
        labels = params[[petal]]$labels
    )
}

Note that (a) I had to augment your breaks to make cut happy about the length of the labels, (b) really I'm just iterating 1, 2. A more robust version would possibly iterate over the names of the list and as a safety check would require the params list to have the same names. Since the names of your lists were different, I just used the indexes.

This could probably be done using mapply. I see no advantage to that - unless you're already comfortable with mapply the only real difference will be that the mapply version will take you 10 times longer to write.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • I will work out how to use the names - in my real data, the names of the elements of the lists are the same. I have been avoiding for loops because people seem to hate them in R, but this was a very clear solution. Thanks! – HFBrowning Apr 21 '16 at 17:20
  • 1
    For loops are really just fine as long as you avoid other bad habits that sometimes go with them (growing objects inside loops instead of pre-allocating, not using vectorization when available). I'd suggest reading [Is R's apply family more than syntactic sugar?](http://stackoverflow.com/q/2275896/903061). – Gregor Thomas Apr 23 '16 at 16:06
3

I like Gregor's solution, but I'd probably stack the data instead:

library(data.table)

# rearrange parameters
params0 = setNames(params, c("iris", "peony"))
my_params = c(list(.id = names(params0)), do.call(Map, c(list, params0)))

# stack
DT = rbindlist(flowers, id = TRUE)

# merge and make cuts
DT[my_params, Sepal.Width.Cut := 
  cut(Sepal.Width, breaks = c(-Inf,cutpoints[[1]],Inf), labels = labels[[1]])
, on=".id", by=.EACHI]

(I've borrowed Gregor's translation of the cutpoints.) The result is:

       .id Sepal.Length Sepal.Width Petal.Length Petal.Width   Species Sepal.Width.Cut
  1:  iris          5.1         3.5          1.4         0.2    setosa       kinda_low
  2:  iris          4.9         3.0          1.4         0.2    setosa       kinda_low
  3:  iris          4.7         3.2          1.3         0.2    setosa       kinda_low
  4:  iris          4.6         3.1          1.5         0.2    setosa       kinda_low
  5:  iris          5.0         3.6          1.4         0.2    setosa       kinda_low
 ---                                                                                  
296: peony          6.7         3.0          5.2         2.3 virginica            okay
297: peony          6.3         2.5          5.0         1.9 virginica       kinda_low
298: peony          6.5         3.0          5.2         2.0 virginica            okay
299: peony          6.2         3.4          5.4         2.3 virginica            okay
300: peony          5.9         3.0          5.1         1.8 virginica            okay

I think stacked data usually make more sense than a list of data.frames. You don't need to use data.table to stack or make the cuts, but it's designed well for those tasks.


How it works.

  1. I guess rbindlist is clear.

  2. The code

    DT[my_params, on = ".id"]
    

    makes a merge. To see what that means, look at:

    as.data.table(my_params)
    #      .id   cutpoints                            labels
    # 1:  iris     1.0,4.5                   low,medium,high
    # 2: peony 1.5,2.5,5.0 too_low,kinda_low,okay,just_right
    

    So, we're merging this table with DT by their common .id column.

  3. When we do a merge like

    DT[my_params, j, on = ".id", by=.EACHI]
    

    this means

    • Do the merge, matching each row of my_params with related rows of DT.
    • Do j for each row of my_params, using columns found in either of the two tables.
  4. j in this case is of the form column_for_DT := cut(...), which makes a new column in DT.

Frank
  • 66,179
  • 8
  • 96
  • 180