3

I already went through different links like: How to convert a factor to an integer\numeric without a loss of information?

but could not solve the problem

I have a data frame

 SYMBOL             PVALUE1             PVALUE2
1   10-Mar   0.813027629406118    0.78820189558684
2   10-Sep 0.00167287722066533 0.00167287722066533
3   11-Mar    0.21179810441316   0.464576340307205
4   11-Sep 0.00221961024320294 0.00221961024320294
5   12-Sep   0.934667427815304   0.986884425214009
6   15-Sep 0.00167287722066533 0.00167287722066533
7    1-Dec   0.464576340307205  0.0911572830792113
8    1-Mar 0.00818426308604705  0.0252302356363697
9    1-Sep    0.60516237199519   0.570568468332992
10   2-Mar  0.0103975819620539 0.00382292568622066
11   2-Sep 0.00167287722066533 0.00167287722066533

When i try str()

str(df)
'data.frame':   20305 obs. of  3 variables:
 $ SYMBOL : Factor w/ 21050 levels "","10-Mar","10-Sep",..: 2 3 4 5 6 7 8 9 10 11 ...
 $ PVALUE1: Factor w/ 209 levels "0","0.000109570493049298",..: 169 22 110 24 181 22 139 39 149 44 ...
 $ PVALUE2: Factor w/ 216 levels "0","0.000109570493049298",..: 172 20 141 23 201 20 90 61 150 29 ...

I try mode()

sapply(df,mode)
SYMBOL   PVALUE1   PVALUE2 
"numeric" "numeric" "numeric" 

When i try to assign values based on the condition below, to the two numeric columns(2,3) by

df$Score <- rowSums(ifelse(df[,-1]==0, 0, 
                                       ifelse(df[, -1]<= 0.05, 2, ifelse(df[,-1]>= 0.065,-2,1))))

I get Warning messages:
1: In Ops.factor(left, right) : ‘<=’ not meaningful for factors
2: In Ops.factor(left, right) : ‘<=’ not meaningful for factors
3: In Ops.factor(left, right) : ‘>=’ not meaningful for factors
4: In Ops.factor(left, right) : ‘>=’ not meaningful for factors

and the output comes like this:

SYMBOL             PVALUE1             PVALUE2       Score
1 10-Mar   0.813027629406118    0.78820189558684         NA
2 10-Sep 0.00167287722066533 0.00167287722066533         NA
3 11-Mar    0.21179810441316   0.464576340307205         NA
4 11-Sep 0.00221961024320294 0.00221961024320294         NA
5 12-Sep   0.934667427815304   0.986884425214009         NA
6 15-Sep 0.00167287722066533 0.00167287722066533         NA

If the factor is already numeric, why the above code is not working and gives NA. How should i proceed.

Edit dput()

structure(list(SYMBOL = structure(1:6, .Label = c("10-Mar", "10-Sep", 
"11-Mar", "11-Sep", "12-Sep", "15-Sep"), class = "factor"), PVALUE1 = structure(c(4L, 
1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533", "0.00221961024320294", 
"0.21179810441316", "0.813027629406118", "0.934667427815304"), class = "factor"), 
    PVALUE2 = structure(c(4L, 1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533", 
    "0.00221961024320294", "0.464576340307205", "0.78820189558684", 
    "0.986884425214009"), class = "factor")), .Names = c("SYMBOL", 
"PVALUE1", "PVALUE2"), row.names = c(NA, 6L), class = "data.frame")

I tried this also:

  indx <- sapply(df, is.factor)
    df[indx] <- lapply(df[indx], function(x) as.numeric(levels(x))[x])

    indx returns 

    SYMBOL PVALUE1 PVALUE2 
       TRUE    TRUE    TRUE 
Warning message:
In FUN(X[[3L]], ...) : NAs introduced by coercion
Community
  • 1
  • 1
AwaitedOne
  • 992
  • 3
  • 19
  • 42
  • I tried `as.numeric(as.character()) ` Warning message: NAs introduced by coercion [1] NA NA NA – AwaitedOne May 08 '15 at 16:44
  • @ForrestR.Stevens your suggested converts two columns like `PVALUE1 PVALUE2 1 169 172 2 22 20 3 110 141 4 24 23 5 181 201 6 22 20` – AwaitedOne May 08 '15 at 16:50
  • @Gregor Don't know if i am missing your point. dput(head(df)) also displays a lot of data – AwaitedOne May 08 '15 at 17:28
  • @Gregor please check file here https://www.dropbox.com/s/swv5dej7u45wde9/df.csv?dl=0 – AwaitedOne May 08 '15 at 17:43
  • I read your csv with `read.csv`, all default options, and I got `SYMBOL` as factor and `PVALUE1` and `PVALUE2` as numeric. – Gregor Thomas May 08 '15 at 17:47
  • You could try `library(data.table); setDT(df)[, 2:3 := lapply(.SD, function(x) as.numeric(levels(x))[x]), .SDcols=2:3]` – akrun May 08 '15 at 18:09
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/77344/discussion-on-question-by-awaitedone-assining-values-to-numeric-factor-levels). – Taryn May 08 '15 at 18:10

2 Answers2

3

Using your dput data, this works just fine:

df = structure(list(SYMBOL = structure(1:6, .Label = c("10-Mar", "10-Sep", 
"11-Mar", "11-Sep", "12-Sep", "15-Sep"), class = "factor"), PVALUE1 = structure(c(4L, 
1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533", "0.00221961024320294", 
"0.21179810441316", "0.813027629406118", "0.934667427815304"), class = "factor"), 
    PVALUE2 = structure(c(4L, 1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533", 
    "0.00221961024320294", "0.464576340307205", "0.78820189558684", 
    "0.986884425214009"), class = "factor")), .Names = c("SYMBOL", 
"PVALUE1", "PVALUE2"), row.names = c(NA, 6L), class = "data.frame")

df$PVALUE1 = as.numeric(as.character(df$PVALUE1))
df$PVALUE2 = as.numeric(as.character(df$PVALUE2))

df
#   SYMBOL     PVALUE1     PVALUE2
# 1 10-Mar 0.813027629 0.788201896
# 2 10-Sep 0.001672877 0.001672877
# 3 11-Mar 0.211798104 0.464576340
# 4 11-Sep 0.002219610 0.002219610
# 5 12-Sep 0.934667428 0.986884425
# 6 15-Sep 0.001672877 0.001672877

sapply(df, class)
#    SYMBOL   PVALUE1   PVALUE2 
#  "factor" "numeric" "numeric" 

If you have issues doing this to your whole data frame, it's possible you have some irregular rows. However, I also looked at the CSV you provided in the comments, and it looks just fine.

Also note that this is one of several equivalent solutions in the duplicate question that you linked.

To convert all but the first column, you could do

df[, 2:ncol(df)] = lapply(df[, -1], function(x) as.numeric(as.character(x)))

Note that you don't want to convert date columns or SYMBOL columns this way as they aren't numeric.

Similarly, to convert columns named, say PVALUE1 to PVALUE47, you could construct the column names and then convert them:

col_to_convert = paste0("PVALUE", 1:47)
df[, col_to_convert] = lapply(df[, col_to_convert], function(x) as.numeric(as.character(x)))

In general, best practice is to not have these columns as factors in the first place. However you get this data into R probably has a way to specify column classes, e.g., colClasses in read.table, read.csv, etc.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
3

An option using data.table

 library(data.table)
 setDT(df)[, 2:3 := lapply(.SD, function(x)
                    as.numeric(levels(x))[x]), .SDcols=2:3]

Or a bit more faster version would be to use set

 indx <- which(sapply(df, is.factor) & grepl('PVALUE', names(df)))
 setDT(df)

 for(j in indx){
   set(df, i=NULL, j=j, value= as.numeric(levels(df[[j]]))[df[[j]]])
 }

I guess the reason why you got the warning is because the 'indx' you created also included the first column (as it is also a factor) but it is non-numeric. By converting non-numeric elements from factor to numeric, those elements will be coerced to NA.

According to ?factor

To transform a factor ‘f’ to approximately its original numeric values, ‘as.numeric(levels(f))[f]’ is recommended and slightly more efficient than ‘as.numeric(as.character(f))’.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks for your explanation . How to set it for all the column except first . I think this is not the correct way `setDT(df)[, -1 := lapply(.SD, function(x) as.numeric(levels(x))[x]), .SDcols= -1]` – AwaitedOne May 08 '15 at 18:28
  • You can use `2:ncol(df) :=` and `.SDcols= 2:ncol(df)` – akrun May 08 '15 at 18:29
  • Your both the methods works fine to change the factor into numeric, however when i try this code `df$Score <- rowSums(ifelse(df[,-1]==0, 0, ifelse(df[, -1]<= 0.05, 2, ifelse(df[,-1]>= 0.065,-2,1)))) ` further to assign score for each entry of the numeric columns(leaving first), it gives error: `Error in rowSums(ifelse(df[, -1] == 0, 0, ifelse(df[, -1] <= : 'x' must be an array of at least two dimensions` – AwaitedOne May 09 '15 at 04:29
  • Please do post that as a new question with some example and expected data – akrun May 09 '15 at 10:47