165

Can anyone please tell me how to read only the first 6 months (7 columns) for each year of the data below, for example by using read.table()?

Year   Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec   
2009   -41  -27  -25  -31  -31  -39  -25  -15  -30  -27  -21  -25
2010   -41  -27  -25  -31  -31  -39  -25  -15  -30  -27  -21  -25 
2011   -21  -27   -2   -6  -10  -32  -13  -12  -27  -30  -38  -29
smci
  • 32,567
  • 20
  • 113
  • 146
StarCub
  • 4,141
  • 7
  • 41
  • 58
  • 5
    It's duplicate of [Ways to read only select columns from a file into R?](http://stackoverflow.com/q/2193742/168747), Dirk mention about `NULL` as column class [in his answer](http://stackoverflow.com/questions/2193742/ways-to-read-only-select-columns-from-a-file-into-r-a-happy-medium-between-rea/2193782#2193782). – Marek Apr 26 '11 at 10:19
  • http://stats.stackexchange.com/questions/16796/reading-only-two-out-of-three-columns-with-read-csv – Ciro Santilli OurBigBook.com Oct 13 '15 at 07:24
  • I was not implying a better / worse relationship. Furthermore there are no cross site duplicates, the inconsistent stack exchange network allows them, unless you cross post yourself :-) – Ciro Santilli OurBigBook.com May 15 '18 at 20:13

5 Answers5

165

Say the data are in file data.txt, you can use the colClasses argument of read.table() to skip columns. Here the data in the first 7 columns are "integer" and we set the remaining 6 columns to "NULL" indicating they should be skipped

> read.table("data.txt", colClasses = c(rep("integer", 7), rep("NULL", 6)), 
+            header = TRUE)
  Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27  -2  -6 -10 -32

Change "integer" to one of the accepted types as detailed in ?read.table depending on the real type of data.

data.txt looks like this:

$ cat data.txt 
"Year" "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
2009 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2010 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2011 -21 -27 -2 -6 -10 -32 -13 -12 -27 -30 -38 -29

and was created by using

write.table(dat, file = "data.txt", row.names = FALSE)

where dat is

dat <- structure(list(Year = 2009:2011, Jan = c(-41L, -41L, -21L), Feb = c(-27L, 
-27L, -27L), Mar = c(-25L, -25L, -2L), Apr = c(-31L, -31L, -6L
), May = c(-31L, -31L, -10L), Jun = c(-39L, -39L, -32L), Jul = c(-25L, 
-25L, -13L), Aug = c(-15L, -15L, -12L), Sep = c(-30L, -30L, -27L
), Oct = c(-27L, -27L, -30L), Nov = c(-21L, -21L, -38L), Dec = c(-25L, 
-25L, -29L)), .Names = c("Year", "Jan", "Feb", "Mar", "Apr", 
"May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame",
row.names = c(NA, -3L))

If the number of columns is not known beforehand, the utility function count.fields will read through the file and count the number of fields in each line.

## returns a vector equal to the number of lines in the file
count.fields("data.txt", sep = "\t")
## returns the maximum to set colClasses
max(count.fields("data.txt", sep = "\t"))
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • 1
    @Benjamin Read the first couple of lines from the file using argument `nrows`. Then work out how many columns there are using `ncol()`, or however else you want to work out the number of columns to read/ignore. Then read the full file using this info. – Gavin Simpson Nov 29 '12 at 00:42
  • argh...thats ugly but I suppose it will have to do – bdeonovic Nov 29 '12 at 00:44
  • 1
    ?? If you don't know the number of columns how else are you going to determine it without reading a bit of it to deduce how many there are? – Gavin Simpson Nov 29 '12 at 00:45
  • 1
    @BlueMagister Thanks for the edit and the mentioning of the `count.fields()` which automates the process I suggested in the comments. – Gavin Simpson Sep 20 '13 at 19:03
  • Can this principle also applied on rows like some *rowClasses*? Etc to set `NULL` on blank row. – Léo Léopold Hertz 준영 May 13 '17 at 17:03
  • 1
    @LéoLéopoldHertz준영 No, and I'm not sure how such a thing would work for row *classes* as in a data frame, whilst each column may be of a different type, each row is, by definition and as a result, unconstrained. You will need to filter out blank rows etc upon import. – Gavin Simpson May 15 '17 at 18:53
  • Is it not possible to limit `count.fields()` to scan n rows?. Looks like it reads through the whole table, every row. – mindlessgreen Jan 22 '18 at 21:42
  • 1
    @rmf you can pass `count.fields()` a text connection, so, read some subset of the rows using `txt <- readLines(....)`, then create a connection to the read in lines `con <- textConnection(txt)`, then do `count.fields(txt)`. Be sure to use `skip` in `count.fields()` to skip the header row if there is one; you can't skip rows in the file using `readLines()`. – Gavin Simpson Jan 22 '18 at 21:57
  • I am trying to use read.table to only read columns whose name is also included in a vector ```labels <- c("a", "b","c"...)```. My issue is that I am reading more than one .txt file, and each file has *some* of the labels in the vector. would there be a way to use read.table and ```%in%``` to only read the labels that match with the ones in each .txt file?. – Pablo Herreros Cantis Apr 29 '20 at 18:55
119

To read a specific set of columns from a dataset you, there are several other options:

1) With freadfrom the data.table-package:

You can specify the desired columns with the select parameter from fread from the data.table package. You can specify the columns with a vector of column names or column numbers.

For the example dataset:

library(data.table)
dat <- fread("data.txt", select = c("Year","Jan","Feb","Mar","Apr","May","Jun"))
dat <- fread("data.txt", select = c(1:7))

Alternatively, you can use the drop parameter to indicate which columns should not be read:

dat <- fread("data.txt", drop = c("Jul","Aug","Sep","Oct","Nov","Dec"))
dat <- fread("data.txt", drop = c(8:13))

All result in:

> data
  Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27  -2  -6 -10 -32

UPDATE: When you don't want fread to return a data.table, use the data.table = FALSE-parameter, e.g.: fread("data.txt", select = c(1:7), data.table = FALSE)

2) With read.csv.sql from the sqldf-package:

Another alternative is the read.csv.sql function from the sqldf package:

library(sqldf)
dat <- read.csv.sql("data.txt",
                    sql = "select Year,Jan,Feb,Mar,Apr,May,Jun from file",
                    sep = "\t")

3) With the read_*-functions from the readr-package:

library(readr)
dat <- read_table("data.txt",
                  col_types = cols_only(Year = 'i', Jan = 'i', Feb = 'i', Mar = 'i',
                                        Apr = 'i', May = 'i', Jun = 'i'))
dat <- read_table("data.txt",
                  col_types = list(Jul = col_skip(), Aug = col_skip(), Sep = col_skip(),
                                   Oct = col_skip(), Nov = col_skip(), Dec = col_skip()))
dat <- read_table("data.txt", col_types = 'iiiiiii______')

From the documentation an explanation for the used characters with col_types:

each character represents one column: c = character, i = integer, n = number, d = double, l = logical, D = date, T = date time, t = time, ? = guess, or _/- to skip the column

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • `fread` does not support compressed files, however. Large files are usually compressed. – CoderGuy123 Feb 28 '16 at 09:03
  • There is a [feature request](https://github.com/Rdatatable/data.table/issues/717) for enabling this in `fread`. Worth noticing is that `fread` will highly probably read the uncompressed file considerably faster than `read.table` will read the compressed file. [See here for an example](http://r.789695.n4.nabble.com/fread-on-gzipped-files-td4663116.html#a4663469). – Jaap Feb 28 '16 at 10:44
  • Some uncompressed files are too large. E.g. I'm working with 1000 Genomes files. They can be 60 GB uncompressed. – CoderGuy123 Feb 28 '16 at 10:45
  • 1
    As you probably know, R reads the data in memory. Whether you read the zipped file or the unzipped file doesn't make a difference on the size of the resulting data in memory. If you have 60GB in files, `read.table` won't save you. In that case, you might want to look at the `ff`-package. – Jaap Apr 20 '16 at 08:42
  • I am trying only to load a few lines a time. The problem is that `read.table()` spends all its time searching for the line to start reading, making it useless for reading large files in increments. They need to fix that function. I had to resort to a non-R solution [Plink](https://www.cog-genomics.org/plink2/) for this problem. Pity. – CoderGuy123 Apr 21 '16 at 05:12
  • FYI: the question is about reading a limited number of columns, not about reading large files. If you have a problem with reading large files, you can [post a new question](http://stackoverflow.com/questions/ask). – Jaap Apr 21 '16 at 06:26
  • 3
    @Deleet You could use `fread` to read large compressed files like this: `fread("gunzip -c data.txt.gz", drop = c(8:13))`. – arekolek Jun 23 '16 at 09:41
  • @Procrastinatus Maximus I would like to echo Deleet's comments w.r.t. to genomics data files. These are often stored compressed to save disk space, but also tend to compress well so that reading it off networked storage then decrypting is faster than reading uncompressed off disk. – pufferfish Oct 28 '16 at 17:40
  • Did `readr` ever add a feature to say (a) guess everything, but (b) only read in a list of column names? I don't want to say `"a" = col_guess(), "b" = col_guess(), ...` I just want to define `vars <- c("a", "b", ...)` and then call it as `col_types = cols_only(vars)` – Mark White Sep 30 '17 at 19:08
  • @MarkWhite see the last possibility of `readr`; based on that you could also use `col_types = '???????______'` which guess the first 7 columns and exclude the others; if you don't specify anything, `readr` will guess the column types for you (so yes, `readr` will guess everything when you don't specify something) – Jaap Sep 30 '17 at 21:32
  • @Jaap but that requires me to know the positions of the variables in the data, which I do not know until *after* I read the data in. I found this solution, which works well for me: https://stackoverflow.com/questions/43901143/readrread-csv-pass-vector-of-character-column-names-to-import – Mark White Sep 30 '17 at 21:35
  • @MarkWhite I think for that situation `fread` works much better; see the first part of my answer. – Jaap Sep 30 '17 at 21:39
  • @Jaap agreed. Yet that function parses NAs strangely when there is conflicts in guessing, so I like that solution I linked to most, imo. It should be a part of the function itself, probably – Mark White Sep 30 '17 at 21:45
  • @MarkWhite If you think `fread` parses NAs strangely when there are conflicts in guessing (which I've never encountered), please report it on [the Github-page](https://github.com/Rdatatable/data.table) (i.e. post a 'new issue'). – Jaap Sep 30 '17 at 21:49
  • @Jaap I am trying to use read.table to only read columns whose name is also included in a vector ```labels <- c("a", "b","c"...)```. My issue is that I am reading more than one .txt file, and each file has *some* of the labels in the vector. would there be a way to use ```read.table``` or ```fread``` and ```%in%``` to only read the labels that match with the ones in each .txt file?. – Pablo Herreros Cantis Apr 29 '20 at 19:00
  • @Jaap edit - I just tried it and it seems like there is no need to specify much - the function read all the columns matching the vector, and returned no error for those inexistent. Wonderful! – Pablo Herreros Cantis Apr 29 '20 at 19:08
9

You could also use JDBC to achieve this. Let's create a sample csv file.

write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) # create example csv file

Download and save the the CSV JDBC driver from this link: http://sourceforge.net/projects/csvjdbc/files/latest/download

> library(RJDBC)

> path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar"
> drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver)
> conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd()))

> head(dbGetQuery(conn, "select * from mtcars"), 3)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1   21   6  160 110  3.9  2.62 16.46  0  1    4    4
2   21   6  160 110  3.9 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85  2.32 18.61  1  1    4    1

> head(dbGetQuery(conn, "select mpg, gear from mtcars"), 3)
   MPG GEAR
1   21    4
2   21    4
3 22.8    4
Rahul Premraj
  • 1,595
  • 14
  • 13
5

The vroom package provides a 'tidy' method of selecting / dropping columns by name during import. Docs: https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#column-selection

Column selection (col_select)

The vroom argument 'col_select' makes selecting columns to keep (or omit) more straightforward. The interface for col_select is the same as dplyr::select().

Select columns by name
data <- vroom("flights.tsv", col_select = c(year, flight, tailnum))
#> Observations: 336,776
#> Variables: 3
#> chr [1]: tailnum
#> dbl [2]: year, flight
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
Drop columns by name
data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour))
#> Observations: 336,776
#> Variables: 13
#> chr [4]: carrier, tailnum, origin, dest
#> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr...
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
Use the selection helpers
data <- vroom("flights.tsv", col_select = ends_with("time"))
#> Observations: 336,776
#> Variables: 5
#> dbl [5]: dep_time, sched_dep_time, arr_time, sched_arr_time, air_time
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
Or rename columns by name
data <- vroom("flights.tsv", col_select = list(plane = tailnum, everything()))
#> Observations: 336,776
#> Variables: 19
#> chr  [ 4]: carrier, tailnum, origin, dest
#> dbl  [14]: year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr...
#> dttm [ 1]: time_hour
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
data
#> # A tibble: 336,776 x 19
#>    plane  year month   day dep_time sched_dep_time dep_delay arr_time
#>    <chr> <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
#>  1 N142…  2013     1     1      517            515         2      830
#>  2 N242…  2013     1     1      533            529         4      850
#>  3 N619…  2013     1     1      542            540         2      923
#>  4 N804…  2013     1     1      544            545        -1     1004
#>  5 N668…  2013     1     1      554            600        -6      812
#>  6 N394…  2013     1     1      554            558        -4      740
#>  7 N516…  2013     1     1      555            600        -5      913
#>  8 N829…  2013     1     1      557            600        -3      709
#>  9 N593…  2013     1     1      557            600        -3      838
#> 10 N3AL…  2013     1     1      558            600        -2      753
#> # … with 336,766 more rows, and 11 more variables: sched_arr_time <dbl>,
#> #   arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>
jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
0

You do it like this:

df = read.table("file.txt", nrows=1, header=TRUE, sep="\t", stringsAsFactors=FALSE)
colClasses = as.list(apply(df, 2, class))
needCols = c("Year", "Jan", "Feb", "Mar", "Apr", "May", "Jun")
colClasses[!names(colClasses) %in% needCols] = list(NULL)
df = read.table("file.txt", header=TRUE, colClasses=colClasses, sep="\t", stringsAsFactors=FALSE)
tedtoal
  • 1,030
  • 1
  • 10
  • 22