3

I'm trying to load in a small number of fields from a tab-separated file with many more unused fields using fread in the data.table package.

To this end, I'm using the select option, which works great for reading in the columns.

However, when I don't specify the classes of the various fields, the automated selector doesn't work (most/all of the numeric variables end up being read as numerically tiny numbers like 1.896916e-316).

To fix this, my first instinct was to change the code from:

DT <- fread("data.txt", select = c ("V1", "V2", ..., "Vn"))

to

DT <- fread("data.txt", select = c("V1", "V2", ..., "Vn"),
            colClasses = c("numeric", ..., "character"))

i.e., to match the select character vector with a colClasses character vector of equal length, with (obviously) the type of the i-th selected field from select set equal to the i-th element of colClasses.

However, fread doesn't seem to like this--even when select is used, colClasses expects a character vector with as many fields as the WHOLE file:

Error in fread("data.txt", select = c("V1", "V2", ..., "Vn", : colClasses is unnamed and length 25 but there are 256 columns. See ?data.table for colClasses usage.

This could be fine if I only had to do this with one file--I'd simply pad out the rest of the character vector with "character" (or whatever type) because they're being tossed anyway.

However, I'm planning to repeat this process 13 times or so on files corresponding to other years--they have the same column names, but appear in perhaps different orders (and there are different numbers of columns from year to year), which ruins the loop-ability (as well as taking a lot more time).

The following worked, but hardly seems efficient (coding-wise):

DT <- fread("data.txt", select=c("V1", "V2", "V3"),
            colClasses = c(V1 = "factor", V2 = "character", V3 = "numeric"))

This is a pain because I'm taking 25 columns, so it's a huge block of code being taken up by specifying the column types. I can't take advantage of rep to save space, e.g.

colClasses = c(rep("character", times = 3), rep("numeric", times = 20))

Any suggestions for making this look/work better?

Here is a preview of the data for reference:

         LEAID FIPST                                                   NAME SCHLEV AGCHRT CCDNF GSLO   V33  TOTALREV  TFEDREV
    1: 0100002    01                                 ALABAMA YOUTH SERVICES      N      3     1   03     0        -2       -2
    2: 0100005    01                                       ALBERTVILLE CITY     03      3     1   PK  4143  38394000  6326000
    3: 0100006    01                                        MARSHALL COUNTY     03      3     1   PK  5916  58482000 11617000
    4: 0100007    01                                            HOOVER CITY     03      3     1   PK 13232 154703000 10184000
    5: 0100008    01                                           MADISON CITY     03      3     1   PK  8479  89773000  6648000
---                                                                                                                       
18293: 5680180    56                                NORTHEAST WYOMING BOCES     07      3     1    N    -2        -2       -2
18294: 5680250    56                                         REGION V BOCES     07      3     1    N    -2        -2       -2
18295: 5680251    56                  WYOMING DEPARTMENT OF FAMILY SERVICES     02      3     1   KG    82        -2       -2
18296: 5680252    56 YOUTH EMERGENCY SERVICES, INC. - ADMINISTRATION OFFICE      N      3     1   07    29        -1       -1
18297: 5680253    56                           WYOMING BEHAVIORAL INSTITUTE      N      N     1   01     0        -2       -2
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • I thought there was a mechanism for "NULL" arguments in colClasses, thereby obviating the need for select. – IRTFM Sep 05 '14 at 18:16

2 Answers2

3

Actually found a solution in a more careful reading of this illustration of the drop/select/colClasses options by Mr. Dowle:

DT <- fread("data.txt", select = c("V1", "V2", "V3"),
            colClasses = list(character = c("char_names"),
                              factor = c("factor_names"),
                              numeric = c("numeric_names")))

I didn't realize this before because there were some other problems with my fread attempts due to bad formatting of my .csv file.

Still, I am wont to call it a bug that the natural approach doesn't work:

DT <- fread("data.txt", select = c("V1", ..., "Vn"),
            colClasses = c("type1", ..., "typen"))
Community
  • 1
  • 1
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
1

Perhaps something along these lines:

 varnames <- readLines(file='filename.txt', n=1)
 valid <- c("LEAID", "FIPST", "NAME", "SCHLEV", "AGCHRT", "CCDNF", "GSLO", "V33", "TOTALREV", "TFEDREV")
 colC <- varnames %in% valid 
 colCchar <- colC
 colCchar[!colC] <-"NULL"
 colCchar[colC] <- c( rep("numeric", 2), rep("character",2),  
                      rep("numeric", 2), "character",
                      rep("numeric", 3) )
 dt<-fread("data.txt", colClasses=colCchar)

Obviously untested since the 200+ first line was not provided. It won't be stable to variation in order of variables in the targets, but your problem description did "leave something to be desired". I cannot quite figure out how the column names would be the same but would vary. You may need to use match to get the order of the desired variables.

IRTFM
  • 258,963
  • 21
  • 364
  • 487