4

I attempt to use read_csv from {readr} to read a CSV file into R. To demonstrate my real issue, I reset the argument guess_max to 5 at first (default is 1000)

library(readr)
formals(read_csv)$guess_max <- 5

and take a smaller literal data for example:

csv <- I(
"ID, Col1, Col2, VarA, VarB, VarC
1, NA, NA, NA, NA, NA
2, NA, NA, NA, NA, NA
3, NA, NA, NA, NA, NA
4, NA, NA, NA, NA, NA
5, 0, 1, x, y, z
6, NA, NA, NA, NA, NA")

read_csv(csv)

# # A tibble: 6 × 6
#      ID  Col1    Col2    VarA   VarB   VarC 
#   <dbl>  <lgl>   <lgl>   <lgl>  <lgl>  <lgl>
# 1     1  NA      NA      NA     NA     NA   
# 2     2  NA      NA      NA     NA     NA   
# 3     3  NA      NA      NA     NA     NA   
# 4     4  NA      NA      NA     NA     NA   
# 5     5  FALSE*  TRUE*   NA*    NA*    NA*
# 6     6  NA      NA      NA     NA     NA

*: parsing issues occur


Affected by guess_max, only the first 5 lines (column names and ID 1 to 4) are used for guessing column types. Because the values in ID 1 to 4 are all missing, all columns are guessed as logical and are parsed incorrectly:

  • 0, 1 (integer) → FALSE, TRUE (logical)
  • 'x', 'y', 'z' (character) → NA (logical)

In this case I have to set col_types manually:

read_csv(csv, col_types = cols(Col1 = col_integer(), Col2 = col_integer(),
                               VarA = col_character(), VarB = col_character(), VarC = col_character()))

# # A tibble: 6 × 6                                                                                                   
#      ID  Col1  Col2 VarA  VarB  VarC 
#   <dbl> <int> <int> <chr> <chr> <chr>
# 1     1    NA    NA NA    NA    NA   
# 2     2    NA    NA NA    NA    NA   
# 3     3    NA    NA NA    NA    NA   
# 4     4    NA    NA NA    NA    NA   
# 5     5     0     1 x     y     z    
# 6     6    NA    NA NA    NA    NA

Supplying the column types one by one is annoying when there are much more columns. If the names of those columns I want to specify have some patterns, I expect to use the <tidy-select>-like syntax to specify a type across multiple columns, such as across() in {dplyr}. The pseudocode is like:

read_csv(csv, col_types = cols(across(starts_with("Col"), col_integer()),
                               across(starts_with("Var"), col_character())))

Is it possible by readr itself or other add-on packages?

Thanks in advance!


Edits

I need to use col_xxx() rather than their abbreviations ('i', 'c', etc.) to create column specification for more general purpose, e.g.

cols(across(contains("Date"), col_date(format = "%m-%d-%Y")),
     across(Fct1:Fct9, col_factor(levels = custom_levels)))
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
user18894435
  • 373
  • 1
  • 10
  • 1
    [This](https://stackoverflow.com/questions/73642144/is-it-possible-to-use-tidyselect-helpers-with-the-cols-only-function) might be helpful – one Mar 08 '23 at 16:17
  • 1
    @OP Could you point out why the linked answer fails to meet your needs? It seems to do exactly what you are asking for. Stating your expectations clearly will substantially improve your chance of getting a good answer – shs Mar 10 '23 at 16:48
  • 1
    @peter861222 the link you found is helpful but It does not use tidy-selection, and it use if...else... to specify column types, so I need to adjust many places whenever I read a different new file. I want a more systematic and functional approach. – user18894435 Mar 14 '23 at 12:53

1 Answers1

6

The read_delim() family has used tidy-selection for selecting columns with the argument col_select. You can take advantage of this argument to incorporate tidy-selection into the specification of column types. The following is a simple implementation. The key is setting n_max = 0L to only read the line of column names.

Version 1

col_across <- function(.cols, .fns, file) {
  col_selected <- read_csv(file, n_max = 0L, col_select = {{.cols}}, show_col_types = FALSE)
  lapply(col_selected, function(x) .fns)
}
df <- read_csv(csv, col_types = c(col_across(starts_with("Col"), col_integer(), csv),
                                  col_across(VarA:VarC, col_factor(c('x', 'y', 'z')), csv)))

The way above is simple but just passable with some drawbacks:

  1. The same file source (i.e. object csv) needs to be passed to every col_across().
  2. The read_delim family includes several variants, e.g. read_csv, read_csv2, read_tsv. When calling df <- read_xxx(...), you have to confirm that col_across() has used the consistent read_xxx to read column names.

Version 2

An improved version of col_across is developed that automatically detects which read_xxx is used, and retrieves the file source from the outer call.

col_across <- function(.cols, .fns) {
  sc <- sys.call(1L)
  sc <- match.call(match.fun(sc[[1L]]), sc)
  read_call <- sc[c(1L, match("file", names(sc), 0L))]
  read_call$n_max <- 0L
  read_call$col_select <- substitute(.cols)
  read_call$show_col_types <- FALSE
  lapply(eval(read_call, parent.frame()), function(x) .fns)
}
df <- read_csv(csv, col_types = c(col_across(starts_with("Col"), col_integer()),
                                  col_across(VarA:VarC, col_factor(c('x', 'y', 'z')))))

Note that this version of col_across must only be used inside the read_delim() family, as across is to mutate in dplyr.


Examine the column specifications

spec(df)

# cols(
#   ID = col_double(),
#   Col1 = col_integer(),
#   Col2 = col_integer(),
#   VarA = col_factor(levels = c("x", "y", "z"), ordered = FALSE, include_na = FALSE),
#   VarB = col_factor(levels = c("x", "y", "z"), ordered = FALSE, include_na = FALSE),
#   VarC = col_factor(levels = c("x", "y", "z"), ordered = FALSE, include_na = FALSE)
# )
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51