28

I have a simple csv file called "test.csv" with the following content:

colA,colB,colC
1,"x",12
2,"y",34
3,"z",56

Let's say I want to skip reading in colA and just read in colB and colC. I want a general way to do this because I have lots of files to read in and sometimes colA is called something else altogether but colB and colC are always the same.

According to the read_csv documentation, one way to accomplish this is to pass a named list for col_types and only name the columns you want to keep:

read_csv('test.csv', col_types = list(colB = col_character(), colC = col_numeric()))

By not mentioning colA it should get dropped from the output. However, the resulting data frame is:

Source: local data frame [3 x 3]

      colA colB colC
    1    1    x   12
    2    2    y   34
    3    3    z   56

Am I doing something wrong or is the read_csv documentation not correct? According to the help file:

If a list, it must contain one "collector" for each column. If you only want to read a subset of the columns, you can use a named list (where the names give the column names). If a column is not mentioned by name, it will not be included in the output.

Naftali
  • 144,921
  • 39
  • 244
  • 303
vergilcw
  • 2,093
  • 4
  • 16
  • 20
  • 2
    `data.table`'s `fread` has `drop` and `select` arguments for this purpose, for reference – MichaelChirico Jul 01 '15 at 02:17
  • 3
    @jaap, NOT a duplicate. This question is about readr::read_csv() and the other question is about utils::read.table(). – Angelo Oct 17 '17 at 14:13
  • @Angelo Yes it is. The linked question is about reading a limited number of columns. At the time of writing of that question, `readr::read_csv` didn't even exist. In the mean time it has been added as an answer (by me) to give alternative approaches to `read.table`/`read.csv` and can therefore serve as a duplicate target. – Jaap Oct 17 '17 at 14:39
  • @Jaap, OK, but perhaps its better to change the question and its tags when providing a canonical answer with a scope significantly greater than the original question? Or you could just answer the questions on their own terms: answer the old question in the context of util::read.table() and this new one in the context of readr::read_*(). – Angelo Oct 17 '17 at 14:54
  • @Angelo In my opinion it is better to change the old question too broaden it's scope (as I just did) because it is used as a canonical duplicate target regularly. – Jaap Oct 17 '17 at 15:08
  • @Jaap: the current accepted answer is out-of-date since github implies this bug was fixed in v 1.1.1 / May 2017. As such, my answer answers the canonical question. – smci May 16 '18 at 23:14
  • Also, did we really want to close on `readr` in favor of a generic question *"Only read limited number of columns in read.table/read.csv"*? `readr` is a different package. – smci May 16 '18 at 23:18
  • I'd also favor reopening this question. I was looking for a `readr` specific answer and this comes up as the top result for "read_csv ignore column r" in google. In the linked "duplicate", the readr solution is buried at the bottom of the second answer. – pgcudahy Jun 05 '19 at 10:56
  • There is now a `col_select` argument in `readr::read_csv` - see for example a newer answer in the duplicate question page https://stackoverflow.com/a/66344762/513463 – guyabel Jul 18 '23 at 06:47

2 Answers2

23

There is an answer out there, I just didn't search hard enough: https://github.com/hadley/readr/issues/132

Apparently this was a documentation issue that has been corrected. This functionality may eventually get added but Hadley thought it was more useful to be able to just update one column type and not drop the others.

Update: The functionality has been added

The following code is from the readr documentation:

read_csv("iris.csv", col_types = cols_only( Species = col_factor(c("setosa", "versicolor", "virginica"))))

This will read only the Species column of the iris data set. In order to read only a specific column you must also pass the column specification i.e. col_factor, col_double, etc...

spies006
  • 2,867
  • 2
  • 19
  • 28
vergilcw
  • 2,093
  • 4
  • 16
  • 20
  • 3
    so the short correct current answer is: NO? – userJT Jul 08 '16 at 15:28
  • The answer is still "no" even after the readr 1.0 release. See https://github.com/hadley/readr/issues/194 – vergilcw Aug 16 '16 at 11:03
  • The github implies this bug was fixed in v 1.1.1 / May 2017. Can you confirm, and update your answer accordingly? – smci May 16 '18 at 23:13
  • An example that reads only the Species column in the iris data set `read_csv("iris.csv", col_types = cols_only( Species = col_factor(c("setosa", "versicolor", "virginica"))) )`. If you want to read only a specific column you must also pass the column specification i.e. `col_factor()`, `col_double`, etc... – spies006 Sep 13 '19 at 14:50
6

"According to the read_csv documentation, one way to accomplish this is to pass a named list for col_types and only name the columns you want to keep"

WRONG: read_csv('test.csv', col_types=list(colB='c', colC='c'))

No, the doc is misleading, you have to either specify that unnamed cols get dropped (class='_'/col_skip()), or else explicitly specify their class as NULL:

read_csv('test.csv', col_types=list('*'='_', colB='c', colC='c'))

read_csv('test.csv', col_types=list('colA'='_', colB='c', colC='c'))
smci
  • 32,567
  • 20
  • 113
  • 146