1

I have a simple dataset that looks like this:

enter image description here

I want to transform the price variables into numeric, which I understand requires first cleaning out the € symbol and changing the comma into a dot. I have done this manually for the price_new variable and it works as intended:

 sp_merrel_df_clean <- sp_merrel_df_raw %>%
    mutate(
      price_new = sub(" €", "", price_new),
      price_new = sub(",", ".", price_new),
      price_new = as.numeric(price_new)
    )

However, as I'm new to R, I wanted to turn this into a more general function for practice. So I used the exact same code as above, but replacing the specific inputs with generic placeholders - presumably it would work, since it's the exact same code...

  currency_to_numeric <- function(raw, clean, var) {
    clean <- raw %>%
      mutate(
        var = sub(" €", "", var),
        var = sub(",", ".", var),
        var = as.numeric(var))
  }

...but instead, when calling:

currency_to_numeric(raw = sp_merrel_df_raw, clean = sp_merrel_df_clean2, var = price_new)

I get this error:

Error: Problem with `mutate()` column `var`.
i `var = sub(" \200", "", var)`.
x object 'price_new' not found

Apologies for what is probably a very basic question, and thank you in advance!

R_Beginner98
  • 35
  • 1
  • 5
  • 1
    You should research programmatic use of `dplyr`; I'd start with https://dplyr.tidyverse.org/articles/programming.html#transforming-user-supplied-variables-1. – r2evans Jul 20 '21 at 19:05
  • 1
    Hello, and welcome! Your problem is the `var` parameter. **R passes by value**, so when your `currency_to_numeric()` function receives `price_new` as its `var`, it checks for the **value** of `price_new`. Since the variable `price_new` is decontextualized — and no `price_new` exists in the `.GlobalEnv`ironment — R cannot find it. Only within the context of `sp_merrel_df_clean`, and where `price_new` is a [masked variable](https://dplyr.tidyverse.org/articles/programming.html), can you toss around such variable names and expect R to recognize them as belonging to a "nearby" `data.frame`. – Greg Jul 20 '21 at 19:13
  • 1
    Also, since (again) R passes by value and **not by reference**, the line `clean <- raw %>%` will do **absolutely nothing**. It will neither take the original variable whose _reference_ is somehow passed as the parameter `clean`, nor will it somehow mutate that original variable. Instead you must simply put `raw %>%` in that line, and then overwrite the original variable with the **result** of `currency_to_numeric()`. – Greg Jul 20 '21 at 19:20
  • I think the problem here is passing arguments to functions that use dplyr nonstandard evaluation – Bill O'Brien Jul 20 '21 at 19:34
  • @Greg Is that what's causing the object not found error though? – Bill O'Brien Jul 20 '21 at 19:38
  • @BillO'Brien I think I've seen this exact situation before, and looking for a decontextualized ["data-variable"](https://dplyr.tidyverse.org/articles/programming.html#data--and-env-variables) is the root cause of that particular error: `object 'price_new' not found`. – Greg Jul 20 '21 at 19:41

2 Answers2

1

Below, I have included a very ergonomic Solution, which (almost) seamlessly mimics the familiar feel of the dplyr workflow. I have also spent some time in diagnosing and addressing conceptual pitfalls.

A common source of confusion is the fact that R passes by value and not by reference. Along with the nuances of programmatic dplyr, this fact is responsible for two conceptual errors in your code.

For the sake of convenience, I have reproduced your sp_merrel_df_raw here as a data.frame:

structure(list(product_name = c("Merrell Riverbed 3", "Sapatilhas Montanha Merrell", "Merrell Moab Adventure", "Merrell Moab 2 Vent"),
               price_new = c("59,99 €", "149,99 €", "99,99 €", "79,99 €"),
               price_old = c("69,99 €", NA, NA, "99,99 €"),
               date = c(210720, 210720, 210720, 210720)),
          row.names = c(NA, -4L),
          class = "data.frame")

You Cannot Modify "in Place"

Serendipitously, I clarified this exact confusion some time ago. Suppose you have a simple numeric variable a and a simple function doubler():

x <- 2

doubler <- function(num) {
  num <- 2 * num
}

Now simply running doubler(x) will do absolutely nothing aside from (invisibly) returning 4. All that happens is that the parameter num is passed the value 2, and then num is overwritten with 4 within the scope of the function. However, the original variable x remains untouched:

doubler(x)

x
# [1] 2

In order to modify x, we must overwrite it (<-) with the results of the function:

x <- doubler(x)

x
# [1] 4

Analogously, when you run your currency_to_numeric() function

currency_to_numeric(raw = sp_merrel_df_raw, clean = sp_merrel_df_clean2, var = price_new)

it will accept the value of sp_merrel_df_clean2, and assign that value to its clean parameter. Everything that happens afterward

    clean <- raw %>% # ...

affects only clean within the scope of the function. When all is said and done, sp_merrel_df_clean2 will never be affected.

Instead, something like this is required, to overwrite sp_merrel_df_clean2 with the new value:

currency_to_numeric <- function(raw, ...) {
  # ...
}

sp_merrel_df_clean2 <- currency_to_numeric(raw = sp_merrel_df_raw, ...)

Decontextualized Variables

As discussed in the dplyr documentation

env-variables are “programming” variables that live in an environment

whereas

data-variables are “statistical” variables that live in a data frame.

Now data-variables are "masked" in the context of certain functions, especially in dplyr. Such masking lets us refer to the sp_merrel_df_raw$price_new column as simply price_new when we perform (say) a mutate() on sp_merrel_df_raw:

sp_merrel_df_raw %>%
  mutate(
    price_new = sub(" €", "", price_new)
    # ...
  )

However, when you run your currency_to_numeric() function

currency_to_numeric(raw = sp_merrel_df_raw, clean = sp_merrel_df_clean2, var = price_new)

var does not become the sp_merrel_df_raw$price_new variable itself.

Rather, R looks for some env-variable named price_new, in surrounding environment (here .GlobalEnv), and attempts to assign its value to the var parameter. Naturally, since no such price_new variable exists in .GlobalEnv, there is no such value, so R throws an error as soon as it tries to use that value in mutate():

Error: Problem with `mutate()` column `var`.
i `var = sub(" \200", "", var)`.
x object 'price_new' not found

This error is comparable to what you would get if you called a function on any other variable that didn't exist:

doubler(num = nonexistent_variable)

# Error in doubler(num = nonexistent_variable) : 
#   object 'nonexistent_variable' not found

However, even if price_new were actually floating around in .GlobalEnv as a typical env-variable, you would still get an error. This is because passing the value of price_new to var is not the same as "pasting" the "name" price_new

mutate(
        price_new = sub(" €", "", price_new),
        price_new = sub(",", ".", price_new),
        price_new = as.numeric(price_new))

wherever the "name" var used to be.

mutate(
        var = sub(" €", "", var),
        var = sub(",", ".", var),
        var = as.numeric(var))

Solution

Here's a nifty reworking of currency_to_numeric () that closely imitates the typical functionality of dplyr:

currency_to_numeric <- function(raw, ...) {
  raw %>%
    mutate(
      across(c(...), ~ sub(" €", "", .x)),
      across(c(...), ~ sub(",", ".", .x)),
      across(c(...), ~ as.numeric(.x))
      )
}

As with virtually any R function, you must still assign the results to sp_merrel_df_clean2, but this solution will help you do so very cleanly

sp_merrel_df_clean2 <- sp_merrel_df_raw %>%
  currency_to_numeric(price_new)

with the following results for sp_merrel_df_clean2:

                 product_name price_new price_old   date
1          Merrell Riverbed 3     59.99   69,99 € 210720
2 Sapatilhas Montanha Merrell    149.99      <NA> 210720
3      Merrell Moab Adventure     99.99      <NA> 210720
4         Merrell Moab 2 Vent     79.99   99,99 € 210720

In fact, you can simultaneously target as many data-variables (like price_new and price_old) as you want

sp_merrel_df_clean2 <- sp_merrel_df_raw %>%
  currency_to_numeric(price_new, price_old)

and covert all your currency columns in one fell swoop!

                 product_name price_new price_old   date
1          Merrell Riverbed 3     59.99     69.99 210720
2 Sapatilhas Montanha Merrell    149.99        NA 210720
3      Merrell Moab Adventure     99.99        NA 210720
4         Merrell Moab 2 Vent     79.99     99.99 210720
Greg
  • 3,054
  • 6
  • 27
0

You can use {{}} ('curly-curly') syntax to pass in unquoted variables.

library(tidyverse)

currency_to_numeric <- function(df, col) {
  df %>% 
    mutate(
      price_new = sub(" €", "", {{col}}),
      price_new = sub(",", ".", {{col}}),
      price_new = as.numeric({{col}})
    )
}

currency_to_numeric(raw, price_new)

# A tibble: 4 x 4
  product_name price_new price_old date      
  <chr>            <dbl> <chr>     <date>    
1 a                 0.54 0,53 €    2021-07-19
2 b                 0.41 1,3 €     2021-07-18
3 c                 0.08 1,66 €    2021-07-17
4 d                 0.35 0,25 €    2021-07-16

Example data


product_name <- letters[1:4]
price_new <- paste(round(abs(rnorm(4)), 2), "€") %>% str_replace(., "\\.", ",")
price_old <- paste(round(abs(rnorm(4)), 2), "€") %>% str_replace(., "\\.", ",")
date <- lubridate::today() - 1:4

raw <- tibble(product_name, price_new, price_old, date)

If you want to update both price_ columns in the same command, you could rewrite currency_to_numeric to operate on a column, instead of an entire data frame. Like this:

currency_to_numeric2 <- function(x) {
  updated = sub(" €", "", x)
  updated = sub(",", ".", updated)
  updated = as.numeric(updated)
}

raw %>% mutate(across(starts_with("price"), currency_to_numeric2))

# A tibble: 4 x 4
  product_name price_new price_old date      
  <chr>            <dbl>     <dbl> <date>    
1 a                 0.54      0.53 2021-07-19
2 b                 0.41      1.3  2021-07-18
3 c                 0.08      1.66 2021-07-17
4 d                 0.35      0.25 2021-07-16
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
  • The last solution with `currency_to_numeric2()` is nifty — especially in permitting `tidy-select`ion of multiple variables — and it was my first instinct before I generalized my own `currency_to_numeric()`. However, **that first solution simply doesn't fulfill OP's intent**. OP wants to find and transform the given [data-variable](https://dplyr.tidyverse.org/articles/programming.html#data--and-env-variables) `var` in `raw`, from `character` (with `,` and `€`) to `numeric`. Yet your `currency_to_numeric()` will **always _overwrite_ `price_new` _specifically_**, with the transformation of `var`. – Greg Jul 21 '21 at 03:33
  • Hey @Greg - thanks for the note. But not sure I understand; OP provided code and wrote "it works as intended", and that code writes over `price_new`. The first function reproduces the intended output. If you run the code you can also see that the original `raw` tibble is left unchanged. – andrew_reece Jul 21 '21 at 03:48
  • Well, the working code provided by OP is not a _function_, and certainly not a function that operates on a parameter `raw`. Rather, it is a (correct) `dplyr` _workflow_, which is written specifically to target the `price_new` variable. When OP generalizes with their `currency_to_numeric()` function, their intent is to take _any_ dataset as the `raw` parameter, transform _any_ `var` variable present in `raw`, and assign the result to `clean` by reference (mistakes were made). Point is: OP expects to pass (say) `price_old` to `currency_to_numeric()` as `var`, and have `price_old` be transformed. – Greg Jul 21 '21 at 04:23