0

Here is a reproducible example:

I will start by assigning the mtcars dataset to a variable called temp.

temp = mtcars

If we try to reference a column in this df, it works as expected. Results in a 'double'.

typeof(temp[,'wt'])
'double'

Now perform a simple group_by and mutate from the dplyr. Then ungroup.

temp = temp %>% group_by(gear) %>% mutate(var.wt = var(wt))
temp = temp %>% ungroup()

The resulting column reference is not a double anymore but a list.

typeof(temp[,'wt'])
'list'

If I try to compute the mean of the referenced column, it doesn't work and results in the following error.

mean(temp[,'wt'])
In mean.default(typeof(temp[, "wt"])) :
  argument is not numeric or logical: returning NA

How do I perform the mean with column reference after the dplyr functions?

divibisan
  • 11,659
  • 11
  • 40
  • 58
Varun
  • 385
  • 1
  • 3
  • 10
  • Some more background can be found here: [Advanced R - Data frames and tibbles](https://adv-r.hadley.nz/vectors-chap.html#tibble) – markus Sep 25 '18 at 21:17

3 Answers3

2

tibbles are strict about subsetting (whereas data.frames are not).

If df is a tibble then indexing with

  • [ will always return a list (a tibble to be precise), and
  • [[ will always return a vector.

This is different to data.frames where indexing a single column with default drop = T automatically converts a list to a vector.

In base R, compare your example with e.g.

# Implicit conversion to vector
mtcars[, "wt"]

and

# Simulating the "tibble way"
mtcars[, "wt", drop = FALSE]

The latter will return a similar error to the one you've experienced when you do mean(mtcars[, "wt", drop = FALSE).

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
2

The dplyr package is part of the tidyverse which is built around a modified version of the data.frame called a tibble which behaves slightly different from a normal data.frame.

class(temp)
[1] "data.frame"

temp2 = temp %>% group_by(gear) %>% mutate(var.wt = var(wt)) %>% ungroup()
class(temp2)
[1] "tbl_df"     "tbl"        "data.frame"

One difference is that when you subset a single column of a tibble, the result remains a tibble, rather than being converted to a vector as with a data.frame:

temp[,'wt']
 [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
[25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780

temp2[,'wt']
# A tibble: 32 x 1
      wt
   <dbl>
 1  2.62
 2  2.88
 3  2.32
 4  3.22
 5  3.44
 6  3.46
 7  3.57
 8  3.19
 9  3.15
10  3.44
# ... with 22 more rows

Since mean expects to act on a vector, it returns an error when you use it with a tibble. You can either use as.data.frame to convert it back to a data.frame:

temp3 <- as.data.frame(temp2)
class(temp3)
[1] "data.frame"

mean(temp3[,'wt'])
[1] 3.21725

Or subset with $ or double brackets [[ which both return vectors:

temp2$wt
 [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
[25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780

mean(temp2$wt)
[1] 3.21725

mean(temp2[['wt']])
[1] 3.21725
divibisan
  • 11,659
  • 11
  • 40
  • 58
1

There are other people that have pointed out that your class has changed and this is why you are getting a problem. But there are reasons why some classes (including tibble in dplyr) do things the way they do. And understanding will help you build more robust code in the future.

Let's look at some objects and their classes.

Start with the mtcars dataset. It's a data.frame.

temp = mtcars
class(temp)
# [1] "data.frame"

When you subset it using the default square brackets you get a numeric vector.

temp2 <- temp[,'wt']
class(temp2)
# [1] "numeric"

When you do some work on the mtcars data using dplyr you get a tibble (aka tbl) back out.

temp3 <- group_by(gear) %>% mutate(var.wt = var(wt)) %>% ungroup()
class(temp3)
# [1] "tbl_df"     "tbl"        "data.frame"

When you try to subset this tibble you get another tibble!!!

class(temp3[,"wt"])
# [1] "tbl_df"     "tbl"        "data.frame"

But WHY!? Well, the answer is that tibbles assume that you always want a tibble back. Dataframes assume that you want a dataframe back unless there is only one column selected. If you are programming through an arbritrary number of columns this is a good thing because your code will always perform the same.

There are two ways to get a tibble to return a column. First is to use the $ notation.

class(temp3$wt)
# [1] "numeric"

The other option is to use the drop = TRUE option. This will change the default behavior of the tibble and make it drop everything but the vector.

class(temp3[,"wt",drop = TRUE])
# [1] "numeric"
Adam Sampson
  • 1,971
  • 1
  • 7
  • 15