1

This seems pretty basic, but the number of verbs in the tidyverse is huge now and I don't know which package to look for this.

Here is the problem. I have a tibble

df <- tibble(f1 = factor(rep(letters[1:3],5)),
             c1 = rnorm(15))

Now if I use the $ operator I can easily find out how many levels are in the factor.

nlevels(df$f1)
# [1] 3

But if I use the [] operator it returns an incorrect number of levels.

nlevels(df[,"f1"])
# [1] 0

Now if df is a data.frame and not a tibble the nlevels() function works with both the $ operator and the [] operator.

So does anyone know the tidyverse equivalent of nlevels() that works on both data.frames and tibbles?

llewmills
  • 2,959
  • 3
  • 31
  • 58
  • 3
    Note that `iris[,5]` is a vector but `as_tibble(iris)[,5]` still inherits from a `data.frame`. This is why `nlevels` is failing. Alternatives include: `nlevels(df$f1)`, `nlevels(df[,"f1",drop=TRUE])`, and `nlevels(df[["f1"]])`. – r2evans Aug 06 '20 at 22:29

3 Answers3

4

Elaborating on the answer from timcdlucas (and the comments from r2evans), the issue here is the behavior of various forms of the extract operator, not the behavior of tibble. Why? a tibble is actually a kind of data.frame as illustrated when we use the str() function on a tibble.

> library(dplyr)
> aTibble <- tibble(f1 = factor(rep(letters[1:3],5)),
+              c1 = rnorm(15))
> 
> # illustrate that aTibble is actually a type of data frame
> str(aTibble)
tibble [15 × 2] (S3: tbl_df/tbl/data.frame)
 $ f1: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
 $ c1: num [1:15] -0.5829 0.3682 1.1854 -0.6309 -0.0268 ...

There are four forms of the extract operator in R: [, [[, $, and @; as noted in What is the meaning of the dollar sign $ in R function?.

The first form, [ can be used to extract content form vectors, lists, matrices, or data frames. When used with a data frame (or tibble in the tidyverse), it returns an object of type data.frame or tibble unless the drop = TRUE argument is included, as noted in the question comments by r2evans.

Since the default setting of drop= in the [ function is FALSE, it follows that df[,"f1"] produces an unexpected or "wrong" result for the code posted with the original question.

library(dplyr)
aTibble <- tibble(f1 = factor(rep(letters[1:3],5)),
             c1 = rnorm(15))

# produces unexpected answer
nlevels(aTibble[,"f1"])

> nlevels(aTibble[,"f1"])
[1] 0

The drop = argument is used when extracting from matrices or arrays (i.e. any object that has a dim attribute, as explained in help for the drop() function.

> dim(aTibble)
[1] 15  2
> 

When we set drop = TRUE, the extract function returns an object of the lowest type available, that is all extents of length 1 are removed. In the case of the original question, drop = TRUE with the extract operator returns a factor, which is the right type of input for nlevels().

> nlevels(aTibble[,"f1",drop=TRUE])
[1] 3

The [[ and $ forms of the extract operator extract a single object, so they return objects of type factor, the required input to nlevels().

> str(aTibble$f1)
 Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
> nlevels(aTibble$f1)
[1] 3
> 
> # produces expected answer
> str(aTibble[["f1"]])
 Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
> nlevels(aTibble[["f1"]])
[1] 3
> 

The fourth form of the extract operator, @ (known as the slot operator), is used with formally defined objects built with the S4 object system, and is not relevant for this question.

Conclusion: Base R is still relevant when using the Tidyverse

Per tidyverse.org, the tidyverse is a collection of R packages that share an underlying philosophy, grammar, and data structures. When one becomes familiar with the tidyverse family of packages, it's possible to do many things in R without understanding the fundamentals of how Base R works.

That said, when one incorporates Base R functions or functions from packages outside the tidyverse into tidyverse-style code, it's important to know key Base R concepts.

Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • 1
    Thank you @Len Greski, an articulate and well thought out answer. I have to accept it for future readers. – llewmills Aug 07 '20 at 22:21
3

I think you might need to use [[ rather than [, e.g.,

> nlevels(df[["f1"]])
[1] 3
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
2

df[,"f1"] returns a tibble with one column. So you're doing nlevels on an entire tibble which doesn't make sense.

df %>% pull('f1') %>% nlevels

gives you what you want.

timcdlucas
  • 1,334
  • 8
  • 20