0

I am new to R and am having issues trying to work with a large dataset. I have a variable called DifferenceMonths and I would like to create a subset of my large dataset with only observations where the variable DifferenceMonths is less than 3.

It is coded into R as a factor so I have tried multiple times to convert it to a numeric. It finally showed up as numeric in my Global Environment, but then I checked using str() and it still shows up as a factor variable.

Log:

DifferenceMonths<-as.numeric(levels(DifferenceMonths))[DifferenceMonths]

Warning message:
NAs introduced by coercion 

KRASDiff<-subset(KRASMCCDataset_final,DifferenceMonths<=2)

Warning message:
In Ops.factor(DifferenceMonths, 2) : ‘<=’ not meaningful for factors

str(KRASMCCDataset_final)

'data.frame':   7831 obs. of  25 variables:
 $ Age                : Factor w/ 69 levels "","21","24","25",..: 29 29 29 29 29 29 29 29 29 29 ...
 $ Alive.Dead         : Factor w/ 4 levels "","A","D","S": 2 2 2 2 2 2 2 2 2 2 ...
 $ Status             : Factor w/ 5 levels "","ambiguous",..: 4 4 5 5 4 5 5 5 4 5 ...
 $ DifferenceMonths   : Factor w/ 75 levels "","#NUM!","0",..: 14 14 14 14 14 14 14 14 14 14 ...

Thank you!

divibisan
  • 11,659
  • 11
  • 40
  • 58
  • 1
    When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. A `str()` isn't as helpful as a `dput()` for testing possible solutions. It also looks like you clearly have values that are not numeric in there like "#NUM!" – MrFlick Mar 28 '18 at 19:21

1 Answers1

1

It's ugly, but you want:

as.numeric(as.character(DifferenceMonths))

The problem here, which you may have discovered, is that as.numeric() gives you the internal integer codes for the factor. The values are stored in the levels. But if you run as.numeric(levels(DifferenceMonths)), you'll get those values, but just as they appear in levels(DifferenceMonths). The way around this is to coerce to character first, and get away from the internal integer codes all together.

EDIT: I learned something today. See this answer

as.numeric(levels(DifferenceMonths))[DifferenceMonths]

Is the more efficient and preferred way, in particular if length(levels(DifferenceMonths)) is less than length(DifferenceMonths).

EDIT 2: on review after @MrFlick's comment, and some initial testing, x <- as.numeric(levels(x))[x] can behave strangely. Try assigning it to a new variable name. Let me see if I can figure out how and when this behavior occurs.

De Novo
  • 7,120
  • 1
  • 23
  • 39
  • Your edit shows exactly what the OP is already doing. This doesn't seem to directly address the problem. – MrFlick Mar 28 '18 at 19:22
  • I ended up using as.numeric(as.character(DifferenceMonths), but I also assigned it to a new variable: DiffMon<-as.numeric(as.character(DifferenceMonths) and that seemed to work – kmardinian Mar 28 '18 at 19:26
  • Is there a way to then add this new variable into my existing dataset? So that I can then create a subset using it? – kmardinian Mar 28 '18 at 19:39
  • I ended up figuring it out and using this KRASMCCDataset_final$Diff <- Diff Thank you for all the help – kmardinian Mar 28 '18 at 19:43