Simultaneously subsetting and operating on a specific column of a data frame

Question

Let's say I have data.frame df

df<-data.frame(a=1:5,b=101:105,c=201:205)

Can I call a subset of this data while simultaneously performing some kind of modification (e.g., arithmetic) to one of the columns (or rows) on the fly?

For example, if I want to return the first and second column of df but return the log of column 1 values. Is there some notation to modify df[,1:2] to produce the following all on the fly?:

            a   b
>1  0.0000000 101
>2  0.6931472 102
>3  1.0986123 103
>4  1.3862944 104
>5  1.6094379 105

@JohnPaul the data.table package might also be a goog alternative, [see my answer](http://stackoverflow.com/a/31885840/2204410) for an implementation — Jaap, Aug 07 '15 at 20:19

Rich Scriven · Accepted Answer · 2015-08-07T19:17:24.357

10

This is a good example for within()

within(df[1:2], a <- log(a))
#           a   b
# 1 0.0000000 101
# 2 0.6931472 102
# 3 1.0986123 103
# 4 1.3862944 104
# 5 1.6094379 105

Or if you prefer not to have <- in the call, you can use brackets

within(df[1:2], { a = log(a) })

edited Aug 07 '15 at 19:17

answered Aug 07 '15 at 19:12

Rich Scriven

97,041
11
181
245

score 6 · Answer 2 · edited May 23 '17 at 12:24

An approach with data.table could be as follows:

library(data.table)
setDT(df)[, .(a=log(a),b)]

A test on large datasets:

library(data.table)
dt1 <- CJ(a = seq(1, 1e3, by=1), b = sample(1e2L), c = sample(1e2L))
df1 <- copy(dt1)
setDF(df1)

The benchmark:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          dt = dt1[, .(a=log(a),b)],
          dplyr = transmute(df1, a = log(a), b = b),
          transform = transform(df1, a = log(a), b = b),
          within = within(df1, a <- log(a))[,1:2],
          twosteps = {df1<-df1[,1:2];df1[,1]<-log(df1[,1])})

       test elapsed relative
5  twosteps   0.249    1.000
4    within   0.251    1.008
3 transform   0.251    1.008
2     dplyr   0.300    1.205
1        dt   0.462    1.855

To my surprise, the data.table approach is the slowest one. While in most other cases (e.g.: one, two) it is the faster approach.

`with(df1, data.frame(a = log(a), b = b))` is always fastest for me, I guess it's pretty similar to the two step though, but it actually returns a df — rawr, Aug 07 '15 at 20:33
@DavidArenburg Not sure what I was thinking. I used that in the benchmark, but apperently not in the first part :-(. Changed now. — Jaap, Aug 09 '15 at 05:37

ulfelder · Answer 3 · 2015-08-07T19:26:47.760

Or the dplyr version:

library(dplyr)
transmute(df, a = log(a), b = b)
          a   b
1 0.0000000 101
2 0.6931472 102
3 1.0986123 103
4 1.3862944 104
5 1.6094379 105

In dplyr, transmute() will return only the variables named in the call to it. Here, we've only actually transformed one of the two variables, but we've included the second one in the result by creating a copy of it. In contrast to transmute(), mutate() will return the entirety of the original data frame, along with the variables created. If you give the new variables the same names as existing ones, mutate() will overwrite those.

One nice thing about the dplyr version is that it's easy to mix transformations and to give the results nice names, like this:

> transmute(df, a.log = log(a), b.sqrt = sqrt(b))
      a.log   b.sqrt
1 0.0000000 10.04988
2 0.6931472 10.09950
3 1.0986123 10.14889
4 1.3862944 10.19804
5 1.6094379 10.24695

Pierre L · Answer 4 · 2015-08-07T19:49:36.623

3

`[`(transform(df, a = log(a)),1:2)      
#          a   b
#1 0.0000000 101
#2 0.6931472 102
#3 1.0986123 103
#4 1.3862944 104
#5 1.6094379 105

You can call a subset while carrying out a function. But it's more sleight-of-hand then simultaneous operation. But the dplyr and other approaches will essentially mask the same behavior. If it is space and code golfing that you are trying to accomplish, this should help. I like the look of Mr.Flick's suggestion but this is a bit faster (bit).

edited Aug 07 '15 at 19:49

answered Aug 07 '15 at 19:05

Pierre L

28,203
6
47
69

What does the "`[`" do?? – theforestecologist Aug 07 '15 at 19:08
2

Why not just `transform(df, a = log(a))[1:2]`? – MrFlick Aug 07 '15 at 19:14
It looks cleaner, but the original is an edge faster – Pierre L Aug 07 '15 at 19:48
@PierreLafortune putting it in my million row benchmark below (the edit to jeremycg's answer), there's no difference. – Gregor Thomas Aug 07 '15 at 20:03
@Gregor I agree, I did the test on 100k. Speed differences are almost exact at those magnitudes. I think there is a 4 or 5 micro-second difference. I think this whole thing is just Friday fun though. – Pierre L Aug 07 '15 at 20:06

score 2 · Answer 5 · edited Aug 07 '15 at 19:53

I'm not convinced any of these are faster than the two step method, just doing it with less keystrokes. Here are some benchmarks:

library(microbenchmark)
microbenchmark(dplyr = {df<-data.frame(a=1:5,b=101:105,c=201:205);df<-transmute(df, a = log(a), b = b)},
               transform = {df<-data.frame(a=1:5,b=101:105,c=201:205);df<-transform(df, a = log(a))},
               within = {df<-data.frame(a=1:5,b=101:105,c=201:205);df<-within(df[1:2], a <- log(a))},
               twosteps = {df<-data.frame(a=1:5,b=101:105,c=201:205);df<-df[,1:2];df[,1]<-log(df[,1])})

Unit: microseconds
      expr      min       lq      mean    median        uq       max neval
     dplyr 1374.710 1438.453 1657.3807 1534.0680 1658.2910  5231.572   100
 transform  489.597  508.413  764.6921  524.9240  569.4680 18127.718   100
    within  493.436  518.396  593.6254  534.9085  585.7880  1554.420   100
  twosteps  421.245  438.909  501.6850  450.6210  491.5165  2101.231   100

To demonstrate Gregor's comment below, first with 5 rows but putting the object creation outside of the benchmarking:

n = 5
df = data.frame(a = runif(n), b = rnorm(n), c = 1:n)

microbenchmark(dplyr = {df2 <- transmute(df, a = log(a), b = b)},
               subset = {df2 <- `[`(transform(df, a = log(a)),1:2)},
               within = {df2 <- within(df[1:2], a <- log(a))},
               twosteps = {df2 <- df[,1:2]; df2[,1]<-log(df2[,1])})
# twosteps looks much better!

But if you increase the number of rows to be big enough where you might care about the speed differences:

n = 1e6
df = data.frame(a = runif(n), b = rnorm(n), c = 1:n)

microbenchmark(dplyr = {df2 <- transmute(df, a = log(a), b = b)},
               subset = {df2 <- `[`(transform(df, a = log(a)),1:2)},
               within = {df2 <- within(df[1:2], a <- log(a))},
               twosteps = {df2 <- df[,1:2]; df2[,1]<-log(df2[,1])})

The differences go away.

Good to know!! My original question was not concerned with processing speed, though. Rather, I was more interested in reducing the lines of code and not having to define or update another object. All answers thus far accomplish that well. From your answer, though, I can see `within` is probably the way to go when using this approach. Thanks! — theforestecologist, Aug 07 '15 at 19:21
With such a small data frame (5 rows), a fairly large percentage of the time will be spent just in creating the data frame objects. It would be a better benchmark if you created the data beforehand---and then you'll see that with such a small dataframe the `twosteps` method is substantially faster, a factor of 2 to 10 (in the slow case of dplyr) times faster. *However*, on such a tiny example the difference is microseconds so it doesn't matter. If you up the size to a million row data frame the differences mostly go away. — Gregor Thomas, Aug 07 '15 at 19:37

Simultaneously subsetting and operating on a specific column of a data frame

5 Answers5