0

I can't seem to find what I need in other posts, essentially,

  1. I need to reorder my data from the data.table read in (I can't give the col classes fread statement because my columns are out of order)
  2. I need to change the columns classes to what I need listed below.

A lot of the other posts seem to be changing all of one type of class to another type of class:

Change the class of many columns in a data frame

Convert column classes in data.table

I believe my problem is different because there is no "change all factors to characters" etc. Each column has a specific class that I must change to ahead of time.

I have my column names in a vector called selectColumns that I pass to fread.

selectColumns <- c(giantListofColumnsGoesHere)
DT <- fread("DT.csv", select=selectColumns, na.strings=NAsList)

setcolorder(DT, selectColumns)
colClasses <- list('character','character','character','factor','numeric','character','numeric','integer','integer','integer','integer','numeric','numeric','factor','factor','factor','logical','integer','numeric','factor','integer','integer','integer','factor','factor','factor','factor','factor','integer','integer','factor','integer','factor','factor','integer','factor','numeric','factor','numeric','character','factor','factor','factor','factor','factor','factor','factor','factor','factor','factor','integer','factor','numeric','factor','factor','character','factor','factor','factor','integer','numeric','integer','integer','integer','integer','integer','factor','character','factor','factor','factor','factor','integer','factor','factor','character','integer','integer','integer','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical','logical')

#Now the part I can't figure out, I've tried:
lapply(DT, class) <- colClasses
#OR
attr(DT, class) <- colClasses
#Obviously attr(DT, class) just gives "data.table" "data.frame"

But I need to subset the DT's column attributes to get the lower level lists somehow, but I'm not great with lists and I can't seem to figure this out. I'm sorry if this is too easy of a question and already been answered essentially, but I'm lost and it seems like there is usually an easy way to do this.

I'm sorry I can't give data because this it contains private information.

Thanks for any help everyone.

Community
  • 1
  • 1
Factuary
  • 43
  • 1
  • 8

1 Answers1

3

Suppose if the OP forgot to use colClasses inside fread or if there is any technical difficulty in using that and wants to change the class of the data.table, using set will be an option

for(j in seq_along(selectColumns)){
     set(DT, i= NULL, j=selectColumns[j], value = get(colClasses[j])(DT[[selectColumns[j]]]))
 } 

str(DT)
#Classes ‘data.table’ and 'data.frame':  5 obs. of  6 variables:
#$ V1: num  1 2 3 4 5
#$ V2: chr  "A" "B" "C" "D" ...
#$ V3: int  1 2 3 4 5
#$ V4: chr  "F" "G" "H" "I" ...
#$ V5: chr  "G" "H" "I" "J" ...
#$ V6: Factor w/ 5 levels "6","7","8","9",..: 1 2 3 4 5

Note that the initial class for the "selectColumns" were

str(DT)
#Classes ‘data.table’ and 'data.frame':  5 obs. of  6 variables:
#$ V1: int  1 2 3 4 5
#$ V2: chr  "A" "B" "C" "D" ...
#$ V3: num  1 2 3 4 5
#$ V4: chr  "F" "G" "H" "I" ...
#$ V5: chr  "G" "H" "I" "J" ...
#$ V6: int  6 7 8 9 10

data

 DT <- data.table(V1= 1:5, V2 = LETTERS[1:5], V3 = as.numeric(1:5),
          V4 = LETTERS[6:10], V5 = LETTERS[7:11], V6 = 6:10)
 colClasses <- paste0("as.",c("numeric", "integer", "factor"))
 selectColumns <- c("V1", "V3", "V6")

NOTE: Added as. to "colClasses" vector to make the conversion. If we are converting 'factor' to 'numeric', then we have to do this in two steps, i.e. first convert to 'character' and then to 'numeric' (Based on @Frank's suggestion in the comments)

akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    You might want to emphasize the prefix you've added to colClasses (which the OP does not have) and to correct the separator (I think you forgot the dot after "as"). Maybe also worth warning about conversion of factors (since you often want to coerce to character before numeric or integer). – Frank Apr 20 '16 at 03:23
  • Thank you very much for your replies. I will try this solution at work tomorrow. Is there a course or a thorough book on learning how to become better with data.table and manipulating data with it? What about converting factor to logical? Is there any point in doing that or would factor work effectively the same way? Many of my columns (near the end) come through in my data as 'Y' or 'N', they are indicators, so TRUE or FALSE would be their real values. – Factuary Apr 20 '16 at 05:12
  • @user6020651 For converting factor to logical, it can be directly done by `==`. For example `factor(c("Yes", "No")) =="Yes" #[1] TRUE FALSE`. Regarding the `data.table` courses, you can check the courses offered by [datacamp](https://www.datacamp.com/courses) – akrun Apr 20 '16 at 05:16
  • Thks4allTheHelp @akrun justFollowUpI triedThis: for(j in seq_along(logicalColumns)){ set(DT, i=NULL, j=logicalColumns[j], value=factor(DT[[logicalColumns[j]]], c("Y","N"),c(TRUE,FALSE))) } That converted all the Y's to TRUE and N's to FALSE, but they are still Factors, so I tried to switch it to logical: for(j in seq_along(logicalColumns)){ set(DT, i=NULL, j=logicalColumns[j], value=get(logicalColClasses[j])(DT[[logicalColumns[j]]])) } but itGivesMeErrorsLike: Can't assign to column 'ABST_DSC_IND' (type 'factor') a value of type 'logical' (not character, factor, integer or numeric) – Factuary Apr 22 '16 at 01:11
  • First this: `for(j in seq_along(logicalColumns)){ set(DT, i=NULL, j=logicalColumns[j], value=factor(DT[[logicalColumns[j]]], c("Y","N"),c(TRUE,FALSE))) }` It was factors though so I tried to change using this: `for(j in seq_along(logicalColumns)){ set(DT, i=NULL, j=logicalColumns[j], value=get(logicalColClasses[j])(DT[[logicalColumns[j]]])) } ` – Factuary Apr 22 '16 at 01:20
  • @Factuary If it is a `factor` column, convert that to `character` class first, and then do the assignment, i.e. `DT1 <- DT[, lapply(.SD, function(x) if(is.factor(x)) as.character(x) else x)]` – akrun Apr 22 '16 at 01:22
  • @akrun, so I decided to work through the data.table course your recommend last night, I had no idea you're a main contributor to the package and someone that developed that course! Thank you, it's fantastic! Is what you showed me there the most computationally fastest way to do that? My data set will be 16 GB, and 22 million rows roughly. – Factuary Apr 22 '16 at 15:41
  • @Factuary I am not a contributor to the package. It is Arun. The `set` method should be very fast. – akrun Apr 22 '16 at 15:42