I attempt to use read_csv
from {readr}
to read a CSV
file into R. To demonstrate my real issue, I reset the argument guess_max
to 5 at first (default is 1000)
library(readr)
formals(read_csv)$guess_max <- 5
and take a smaller literal data for example:
csv <- I(
"ID, Col1, Col2, VarA, VarB, VarC
1, NA, NA, NA, NA, NA
2, NA, NA, NA, NA, NA
3, NA, NA, NA, NA, NA
4, NA, NA, NA, NA, NA
5, 0, 1, x, y, z
6, NA, NA, NA, NA, NA")
read_csv(csv)
# # A tibble: 6 × 6
# ID Col1 Col2 VarA VarB VarC
# <dbl> <lgl> <lgl> <lgl> <lgl> <lgl>
# 1 1 NA NA NA NA NA
# 2 2 NA NA NA NA NA
# 3 3 NA NA NA NA NA
# 4 4 NA NA NA NA NA
# 5 5 FALSE* TRUE* NA* NA* NA*
# 6 6 NA NA NA NA NA
*
: parsing issues occur
Affected by guess_max
, only the first 5 lines (column names and ID
1 to 4) are used for guessing column types. Because the values in ID
1 to 4 are all missing, all columns are guessed as logical
and are parsed incorrectly:
0
,1
(integer) →FALSE
,TRUE
(logical)'x'
,'y'
,'z'
(character) →NA
(logical)
In this case I have to set col_types
manually:
read_csv(csv, col_types = cols(Col1 = col_integer(), Col2 = col_integer(),
VarA = col_character(), VarB = col_character(), VarC = col_character()))
# # A tibble: 6 × 6
# ID Col1 Col2 VarA VarB VarC
# <dbl> <int> <int> <chr> <chr> <chr>
# 1 1 NA NA NA NA NA
# 2 2 NA NA NA NA NA
# 3 3 NA NA NA NA NA
# 4 4 NA NA NA NA NA
# 5 5 0 1 x y z
# 6 6 NA NA NA NA NA
Supplying the column types one by one is annoying when there are much more columns. If the names of those columns I want to specify have some patterns, I expect to use the <tidy-select>-like syntax to specify a type across multiple columns, such as across()
in {dplyr}
. The pseudocode is like:
read_csv(csv, col_types = cols(across(starts_with("Col"), col_integer()),
across(starts_with("Var"), col_character())))
Is it possible by readr
itself or other add-on packages?
Thanks in advance!
Edits
I need to use col_xxx()
rather than their abbreviations ('i'
, 'c'
, etc.) to create column specification for more general purpose, e.g.
cols(across(contains("Date"), col_date(format = "%m-%d-%Y")),
across(Fct1:Fct9, col_factor(levels = custom_levels)))