3

After reading the convincing book R for Data Science I was excited about all the tidyverse functions, especially the transformation and data wrangling components dplyr and tidyr. It seemed that coding with those saves a lot of time and results in better readability compared to base R. But the more I use dplyr, the more I encounter situations where the opposite seems to be the case. In one of my last questions I asked how to replace rows with NAs if one of the variable exceeds some threshold. In base I would simply do

df[df$age > 90, ] <- NA

The two answers suggested using

df %>% select(x, y, age) %>% mutate_all(~replace(.x, age> 90, NA))
# or
df %>% mutate_all(function(i) replace(i, .$age> 90, NA))

Both answers are great and I am thankful to get them. Still, the code in base R seems so much simpler to me. Now I am facing another situation where my code with dplyr is much more complicated, too. I am aware that it is a subjective impression whether some code is complicated, but putting it in a more objective way I would say that nchar(dplyr_code) > nchar(base_code) in many situations.

Further, I noticed that I seem to encounter this more often if the code I need to write is about operations on rows rather than on columns. It can be argued that one can use tidyr from tidyverse to transpose the data in order to change rows to columns. But even doing this seems also much more complicated in the tidyverse frame than in base R (see here).

My question is whether I am facing this problem because I am quite new to tidyverse or whether it is the case that coding with base is more efficient in some situations. If latter is the case: Are there resources that summarize on a abstract level when it is more efficient to code with base versus tidyverse or can you state some situations? I am asking because sometimes I spend quite some time to figure out how to solve something with tidyverse and in the end I notice that base is a much more convenient coding in this situation. Knowing when to use tidyverse or base for data wrangling and transformation would save me much time.

If this question is too broad, please let me know and I will try to rephrase or delete the question.

  • The advantage advocated by a lot of tidyverse advocates is in terms of code clarity and pedagogy. I see great value in tidyverse for some actions, but as you noted, in some cases the code is orders of magnitude more complex and, to me at least, often less readable. tidyverse packages seem to require you to buy into the whole ecosystem, even when base R (or other packages) provide shorter, faster, more comprehensible alternatives. – alan ocallaghan Jan 29 '20 at 09:10
  • To add on to that, the tidyverse is designed around tidy data where each column is a variable and each row an observation. Because of this, doing something like making an entire row `NA` essentially removes the observation, so I think a tidy equivalent of your code would actually be `filter(df, age <= 90)`, to remove the row entirely (might not actually be what you want, however). – caldwellst Jan 29 '20 at 09:17
  • Yes, understood, that's because as I said above, tidyverse is designed around the idea of 1 column as 1 variable, which is why it's ill-designed for functions operating by row, as you noted. – caldwellst Jan 29 '20 at 09:22
  • 1
    I don't know about too broad, but this question is asking for either opinions or, per your bold question, off-site resources, both of which are off-topic. – TylerH Jan 29 '20 at 14:23
  • @TylerH It is actually more complex than that. It might be the case that there is a general rule that says in which situations it is less efficient to use the syntax of `tidyverse`. If such a rule exists the question is in fact not opinion-based. By refering the question as opinion-based you implicitly answer the question with "there is no such general rule, just different subjective situations". And I am not sure whether this is the case and whether you and the others voting for closing the question are aware of that circumstance. –  Jan 29 '20 at 15:22
  • @machine You need to define what you mean by "efficient" in *objective* terms to avoid that being opinion-based. There's not much complexity there. – TylerH Jan 29 '20 at 15:24
  • @TylerH I did so with code length. For example: `nchar("df[df$age > 90, ] <- NA") < nchar("df %>% select(x, y, age) %>% mutate_all(~replace(.x, age> 90, NA)) ")`. Whether this is a good measure of efficiency is onther question, but it is an objective one –  Jan 29 '20 at 15:27
  • @machine knowing whether code length is a good measure of efficiency is **irrelevant**. What matters is whether you are *choosing it as the objective measure of "efficiency"* for the purpose of your question. If you are, you need to edit the question to explicitly state this. And you need to remove your request for off-site resources. Even then, "how do I know when it will take less code to use vanilla R or some R package for some task" is still fairly broad, likely only answerable with "you have to write it out to see for yourself". – TylerH Jan 29 '20 at 16:06
  • @TylerH thank you. I noticed another problem about the question: it is not directly about code. In my understanding thus the question does not fit to SO. –  Jan 29 '20 at 16:31
  • Base R code often seems more intuitive for people with exprience in other programming languages. tidyverse is more "SQL-like" – David Feb 02 '21 at 08:54

1 Answers1

6

If you have a clean, readable and functioning solution in base R that seems more appropriate, why would you go for an additional layer? Perhaps to keep the same interface (pipes) within a script, to advance readability? But as you argue, this is not always guaranteed with tidyverse compared to base R.

A main difference is:

Base R is highly focused on stability, whereas the tidyverse will not guarantee this. From their own documentation: "tidyverse will make breaking changes in the search for better interfaces" (https://tidyverse.tidyverse.org/articles/paper.html).

This makes base R in some cases a better partner for production environments, as you may find tidyverse functions deprecating and changing over time. I myself prefer as less dependencies as possible in packages.

Arcoutte
  • 810
  • 4
  • 16