0

For example, in Pandas, you always need to refer to a column in DataFrame by its name in a string:

df = pd.DataFrame(list(range(1,10)),columns = ["a"])
df["a"]

But in R, including some of its packages, such as data.table and dplyr, you are allowed to refer to a column without quotes, like in this way:

dt <- data.table(a = 1:10)
dt[,.(a)]

In my opinion, referring to column name unquoted is a disaster. The only benefit you get is that you don't need to type "". But the downsides are unlimited:

1) Very often you will need to select columns programmatically. With column name unquoted, you need to differentiate the variables in "outer" and "inner" context.

col_name <- "a"
dt[,..col_name]

2) Even if you manage to select the columns specified in a vector of strings, it's very hard to do (complex) operations on them. As mentioned in this question, you need to do in this way:

diststr = "dist"
valstr = "val"

x[get(valstr) < 5, c(diststr) := 
get(diststr)*sum(get(diststr))]

All in all, the feeling I have is that wrangling data in R is not straightforward/natural at all compared to the way done in pandas. Could someone please explain are there any upsides of this?

Catiger3331
  • 611
  • 1
  • 6
  • 18
  • 1
    but from base R also see `with()`, `within()`, `subset()`, model formulas, ... pros and cons of *non-standard evaluation* are a huge can of worms, but I voted to close as opinion-based ... – Ben Bolker Nov 29 '18 at 20:50
  • This is opinion based, but "the downsides are unlimited" is false. I'm not as familiar with `data.table`, but `dplyr` evaluation is completely unambiguous. And yes, it is to save typing. Typing `""` requires 2 to 4 key strokes (2 if you use Rstudio), and when you have to type so many variables, it becomes a lot. – thc Nov 29 '18 at 20:55
  • Also in Rstudio, unquoted variables allows for tab autocompletion of variable names, which may not have been possible with strings. – thc Nov 29 '18 at 20:56

1 Answers1

2

in Pandas you can refer to suitably named columns without quotes, e.g:

df = pd.DataFrame(dict(
  a=[1,2,3],
  b=[5,6,7],
))
print(df.a)

is valid, concise and similar syntax works in R.

the choice depends on how much the code's author knows about the dataset and what is convenient at the time — for quick analyses this is great, for more repeatable workflows this can be awkward.

I also tend to use unquoted variable accessors a lot when working with databases — column names basically always valid identifiers

df = pd.read_sql('select a, b from foo', dbcon)
df.a

or

df <- dbGetQuery(dbcon, 'select a, b from foo')
df$a

for Pandas and R respectively…

each language/library provides the tools, it's up to you to use them appropriately!

Sam Mason
  • 15,216
  • 1
  • 41
  • 60