I'm trying to load in a small number of fields from a tab-separated file with many more unused fields using fread
in the data.table
package.
To this end, I'm using the select
option, which works great for reading in the columns.
However, when I don't specify the classes of the various fields, the automated selector doesn't work (most/all of the numeric variables end up being read as numerically tiny numbers like 1.896916e-316
).
To fix this, my first instinct was to change the code from:
DT <- fread("data.txt", select = c ("V1", "V2", ..., "Vn"))
to
DT <- fread("data.txt", select = c("V1", "V2", ..., "Vn"),
colClasses = c("numeric", ..., "character"))
i.e., to match the select
character vector with a colClasses
character vector of equal length, with (obviously) the type of the i-th selected field from select set equal to the i-th element of colClasses
.
However, fread
doesn't seem to like this--even when select
is used, colClasses
expects a character vector with as many fields as the WHOLE file:
Error in
fread("data.txt", select = c("V1", "V2", ..., "Vn",
:colClasses
is unnamed and length 25 but there are 256 columns. See?data.table
forcolClasses
usage.
This could be fine if I only had to do this with one file--I'd simply pad out the rest of the character vector with "character"
(or whatever type) because they're being tossed anyway.
However, I'm planning to repeat this process 13 times or so on files corresponding to other years--they have the same column names, but appear in perhaps different orders (and there are different numbers of columns from year to year), which ruins the loop-ability (as well as taking a lot more time).
The following worked, but hardly seems efficient (coding-wise):
DT <- fread("data.txt", select=c("V1", "V2", "V3"),
colClasses = c(V1 = "factor", V2 = "character", V3 = "numeric"))
This is a pain because I'm taking 25 columns, so it's a huge block of code being taken up by specifying the column types. I can't take advantage of rep
to save space, e.g.
colClasses = c(rep("character", times = 3), rep("numeric", times = 20))
Any suggestions for making this look/work better?
Here is a preview of the data for reference:
LEAID FIPST NAME SCHLEV AGCHRT CCDNF GSLO V33 TOTALREV TFEDREV
1: 0100002 01 ALABAMA YOUTH SERVICES N 3 1 03 0 -2 -2
2: 0100005 01 ALBERTVILLE CITY 03 3 1 PK 4143 38394000 6326000
3: 0100006 01 MARSHALL COUNTY 03 3 1 PK 5916 58482000 11617000
4: 0100007 01 HOOVER CITY 03 3 1 PK 13232 154703000 10184000
5: 0100008 01 MADISON CITY 03 3 1 PK 8479 89773000 6648000
---
18293: 5680180 56 NORTHEAST WYOMING BOCES 07 3 1 N -2 -2 -2
18294: 5680250 56 REGION V BOCES 07 3 1 N -2 -2 -2
18295: 5680251 56 WYOMING DEPARTMENT OF FAMILY SERVICES 02 3 1 KG 82 -2 -2
18296: 5680252 56 YOUTH EMERGENCY SERVICES, INC. - ADMINISTRATION OFFICE N 3 1 07 29 -1 -1
18297: 5680253 56 WYOMING BEHAVIORAL INSTITUTE N N 1 01 0 -2 -2