0

I have been tearing my hair out over this for the last hour, the following code was working perfectly a couple of hours ago, and now I have no idea why it doesn't anymore. I have searched for other questions regarding the undefined columns selected error, but I think I have corrected for all of the info in those answers. I am sure there is some tiny thing I have overlooked or accidently left in, but I can't see it!

I have a data frame with both factor and numeric variables, I want to subset so that I keep all of the factor variables, and remove numeric variables whose columns have a mean < 0.1.

I found the following code on another question on stackoverflow, which slightly modified worked well on my test data (smaller sub-dataset I am using for testing before trying out code on a big 3GB object)

meanfunction01 <- function(x){
    if(is.numeric(x)){
        mean(x) > 0.1
      } else {
    TRUE}
}

#then apply function to data table
Zdata <- Data1[,sapply(Data1,  meanfunction01)]

I swear I was using this a few hours ago, then when i came back to it and tried to use it again it stopped working and now just returns the following error:

Error in `[.data.frame`(Data1, , sapply(Data1, meanfunction01)) : 
  undefined columns selected

I was trying to modify the function so that it would loop over multiple objects (I have 54 objects I want to apply it to, and didn't want to type them all manually), but I don't think I edited the original function, and now it has stopped working.

A brief str() of my data:

> str(Data1[1:10])
'data.frame':   11 obs. of  10 variables:
 $ Name               : Factor w/ 11688 levels "GTEX-1117F-0226-SM-5GZZ7",..: 8186 8242 8262 8270 8343 8388 8403 8621 8689 8709 ...
 $ SEX                : Factor w/ 2 levels "Female","Male": 1 2 2 1 1 2 2 1 2 1 ...
 $ AGE                : Factor w/ 6 levels "20-29","30-39",..: 4 4 1 3 3 1 3 3 3 2 ...
 $ CIRCUMSTANCES: Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Tissue.x           : Factor w/ 53 levels "Adipose_Subcutaneous",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ ENSG00000223972.4  : num  0 0.0701 0.0339 0.1149 0.0549 ...
 $ ENSG00000227232.4  : num  12.5 17.2 13.1 16 15.7 ...
 $ ENSG00000243485.2  : num  0.0717 0 0.1508 0 0.061 ...
 $ ENSG00000237613.2  : num  0 0.0654 0 0.0402 0.0768 ...
 $ ENSG00000268020.2  : num  0 0.0421 0.0611 0 0 ...
Phil D
  • 183
  • 10
  • It will be very difficult to guess what the problem might be without a working example that demonstrates the issue. You may find some helpful tips in https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Ista Nov 22 '18 at 16:17
  • Ok, in trying to head() my data in order to create an example dataset for you, think I have narrowed down the problem. When I re-loaded the object into my environment, it seems to have categorised some of my columns as integers rather than numeric. When I run the code on the head([1:20]) subset it works fine, as the integer columns start appearing later on in the date around column 10000 or something. Now I am trying to figure out how to recatagorise these columns as numeric instead, which is a different problem entirely. Thanks anyway! – Phil D Nov 22 '18 at 16:52
  • Why just suspect that the data structure changes as you move along your multiple columns? Why not check how the columns are structured by running `head()` without restricting the range of columns? If it turns out that indeed some columns are integers rather than numeric then re-structure them using something along these lines: `Data1[,4:33] <- lapply(Data1[,4:33], as.numeric)` – Chris Ruehlemann Nov 22 '18 at 18:44
  • Well, mostly because using head() without restricting columns in a data frame with >50000 columns will be difficult to check manually! I have started restructuring using lapply though – Phil D Nov 26 '18 at 11:24

1 Answers1

1

So if your only issue is changing the class of the integer variables in your data.frame but you have many columns (>10000) you may want to consider converting your data.frame into a data.table. Your code would then look like this:

library(data.table)
Data1<-data.table(Data1) #or if you have your data in csv document just use fread instead of read.csv which will automatically give you a data.table.

Then you just need to find the integer columns using this:

which(sapply(Data1,is.integer))

Putting it altogether using the data.table commands:

Data1[,which(sapply(Data1,is.integer)):=lapply(.SD,as.numeric),.SDcols=which(sapply(Data1,is.integer))]

Note you don't need to assign the above line of code into anything since data.table uses pointers which makes it much faster than data.frame or tibbles objects. So running the above line will update your Data1 object efficiently. The classes of the other non-integer columns (i.e., factors) will remain unchanged.

Please update if you have further questions but this should answer your comment. Best of luck!

Jason Johnson
  • 451
  • 3
  • 7
  • Thanks for the advice, but I figured out the solution (to this problem at least) myself in the end, and it wasn't the integers causing this problem (although they caused a different problem), it was simply the presence of NA values, which I didn't realise I could remove by using na.rm = TRUE inside the mean() function. Sorry for the very basic questions requiring very basic solutions, but I am still very new to coding and teaching myself! – Phil D Nov 26 '18 at 11:35