1

I'd like to use the new native pipe,|>, with purrr::map_dfr(). (To make it reproducible, I'm passing the datasets as strings instead of paths, but that shouldn't make a difference.)

csvs <- c(
  "csv_a" = "a,b,c\n1,2,3\n4,5,6",
  "csv_b" = "a,b,c\n-1,-2,-3"
)
col_types <- readr::cols(.default = readr::col_character())

# Approach 1
csvs |> 
  purrr::map_dfr(
    .f = function(p) {
      readr::read_csv(
        file = I(p),
        col_types = col_types
      )
    }
  )

# Approach 2
library(magrittr)
csvs %>%
  purrr::map_dfr(
    .x = .,
    .f = ~readr::read_csv(
      file      = I(.),
      col_types = col_types
    )
  )

I have two questions, mostly to continue my understanding of the native pipe.

Question 1

How do I replace the explicit function(p) part with the new {\(x)...}() syntax? The attempt below throws "Error in standardise_path(file) : argument "p" is missing, with no default".

csvs |> 
  purrr::map_dfr(
    .f = 
      {\(p)
        readr::read_csv(
          file      = I(p),
          col_types = col_types
        )
      }()
  )

Question 2

Can I also mimic the magrittr approach (#2)? This somehow reads each row twice, including the header.

csvs |> 
  {\(p)
    purrr::map_dfr(
      .x = p,
      .f = ~readr::read_csv(
        file      = I(p),
        col_types = col_types
      )
    )
  }()

# Produces
# A tibble: 8 x 3
  a     b     c    
  <chr> <chr> <chr>
1 1     2     3    
2 4     5     6    
3 a     b     c    
4 -1    -2    -3   
5 1     2     3    
6 4     5     6    
7 a     b     c    
8 -1    -2    -3   

edit: In response to @MrFlick's comment, I've wrapped the argument to file with I() in case that becomes a requirement in future versions of readr (it seems to work fine now without it). If you're passing typical file paths (instead of literal strings), remove the call to I().

wibeasley
  • 5,000
  • 3
  • 34
  • 62
  • is `csvs |> purrr::map_dfr( readr::read_csv )` not sufficient? – Onyambu Aug 26 '21 at 00:11
  • oops, when I made it too minimal. I'm going to revise it with a 2nd argument to `read_csv()`. – wibeasley Aug 26 '21 at 00:19
  • You are already using `tidyverse` functions, why do you need the native pipeOP? – Onyambu Aug 26 '21 at 00:21
  • 1
    I may not understand your question about why. Are you asking why use the native pipe (`|>`) since the tidyverse packages already load the magrittr package `(%>%)`? If so --because I want to learn how to do it, and I'm guessing this need will arise when I use non-tidyverse packages too. – wibeasley Aug 26 '21 at 00:29
  • No that is not my question. Why do you need to use `|>` pipe instead of using `%>%` pipe? – Onyambu Aug 26 '21 at 00:30
  • If at all you are going to use `|>` then avoid using `tidyverse` functions. Better use base R functions. ie `csvs |> lapply(\(x)read.csv(text=x))|> {\(x)do.call(rbind, x)}()` – Onyambu Aug 26 '21 at 00:31
  • I guess I disagree about the division. It looks like the native pipe is a good fit for tidyverse functions too. It should have [better debugging info](https://www.jumpingrivers.com/blog/new-features-r410-pipe-anonymous-functions/) and a [performance advantage is some scenarios](https://stackoverflow.com/questions/67633022/what-are-the-differences-between-rs-new-native-pipe-and-the-magrittr-pipe). And the native pipe was suggested by [two RStudio/tidyverse](https://developer.r-project.org/blosxom.cgi/R-devel/NEWS/2020/12/04) [developers](https://youtu.be/X_eDHNVceCU?t=4151). – wibeasley Aug 26 '21 at 00:45
  • 1
    What version of `readr` are you using? The latest version allows you to pass in multiple files names and have them combined already. You need to use `I()` if you want to pass literal data now. See [Reading multiple files at once](https://cran.r-project.org/web/packages/readr/news/news.html) under the 2.0 notes. – MrFlick Aug 26 '21 at 06:22
  • @MrFlick, you're right, that's an even better way. I posted this to mostly learn about the new pipes, but in this case it's nice to avoid and have readr do it. If you post it as an answer, I'll happily upvote it. – wibeasley Aug 27 '21 at 01:59
  • @MrFlick, I just learned of one difference. When passing a vector of file paths to readr, all the incoming files need the same structure. However the `purrr::map_dfr()` is more flexible. I can pass it a `readr::cols_only()` object, and it doesn't care that the input files have different (discarded/ignored) columns. – wibeasley Sep 27 '21 at 19:31

2 Answers2

1

Answer for Question 1 -

csvs |> 
  purrr::map_dfr(
    .f = \(k) {
      readr::read_csv(
        file      = k,
        col_types = col_types
      )
    }
  )

#     a     b     c
   <chr> <chr> <chr>
#1     1     2     3
#2     4     5     6
#3    -1    -2    -3
wibeasley
  • 5,000
  • 3
  • 34
  • 62
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks works well. (I included the effect of `col_types` , just to be consistent. ) So `function(k)` is replaced by `\(k)`. Can you explain (a) why the `\(k)` is outside the curly brackets and (b) why the curly brackets aren't followed by and empty set of parentheses`()`? I'm delighted it works, but it's different than what I've learned so far about the native pipe. – wibeasley Aug 26 '21 at 02:24
  • 1
    The function that we are applying through pipes is `map_dfr` which already has `()`. Here we have replace the anonymous function of `map_dfr` (`function(k)`) with `\(k)` which does not require `()`. – Ronak Shah Aug 26 '21 at 02:36
1

Answer for Question 2: for the inner function, you use p, which reuses csvs on each call. So the inner function ignores the value its mapping over and instead uses the whole list. You may avoid that using the .x pronoun:

csvs |> 
  {\(p)
    purrr::map_dfr(
      .x = p,
      .f = ~readr::read_csv(
        file      = I(.x),
        col_types = col_types
      )
    )
  }()

Stylistically, it might be nicer to avoid the formula mapper altogether, since you don't have any custom behavior in your function. The ... in purrr::map_dfr will be passed on to the function on each call.1

csvs |> 
  {\(p) purrr::map_dfr(.x = p, .f = readr::read_csv, col_types = col_types)}()

Since you don't reuse the p argument, the anonymous function is also unnecessary:

csvs |> 
  purrr::map_dfr(.f = readr::read_csv, col_types = col_types)

1@MrFlick is correct in that I() should be used in principle if you're expecting strings instead of a file name, however in your case, you do not need it because there is a newline in all strings in the csvs vector. See here for details. I take it out to illustrate your alternatives.

Bob Zimmermann
  • 938
  • 7
  • 11
  • One other style of defining your mapper is to use the functional composition tools in purrr. For example, you can generate a new function which reads the csv with characters using `read_csv_chr <- purrr::partial(readr::read_csv, col_types = col_types)`. You can then compose that together with the `I()` function to another function: `read_csv_chr_from_chr <- purrr::compose(read_csv_chr, I)`. The mapping is then simply `csvs |> purrr::map_dfr(read_csv_chr_from_chr)`, and the other functions can be reused. – Bob Zimmermann May 11 '22 at 16:49