4

I have to import many datasets automatically with the first column being a name, so a character vector, and the second column being a numeric vector, so I was using these specifications with read.table: colClasses = c("character", "numeric").

This works great if I have a dataframe saved in a df_file like this:

df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("1e-04","1e-04","1e-04","1e-04")

read.table(df_file, header = FALSE,  comment.char="", colClasses = c("character", "numeric"), stringsAsFactors=FALSE)

The problem is in some cases I have dataframes with numeric values in the form of exponential in the second column, and in these cases the import does not work since it does not recognise the column as numeric (or it imports as "character" if I don't specify the colClasses), so my question is: how can I specify a column to be imported as numeric even when the values are exponential?

For example:

df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("10^(-4)","10^(-4)","10^(-4)","10^(-4)"))

I want all the exponential values to be imported as numeric, but even when I try to change from character to numeric after they are imported I get all "NA" (as.numeric(as.character(df$V2)) "Warning message: NAs introduced by coercion ")

I have tried to use "real" or "complex" with colClasses too but it still imports the exponentials as character.

Please help, thank you!

user2337032
  • 305
  • 2
  • 4
  • 10
  • 2
    try something like: `as.numeric(gsub("0\\^", "e", gsub("[()]", "", df$V2)))` – Arun Jun 24 '13 at 11:35
  • possible duplicate of [Extract info inside all parenthesis in R (regex)](http://stackoverflow.com/questions/8613237/extract-info-inside-all-parenthesis-in-r-regex) – Roman Luštrik Jun 24 '13 at 11:37
  • Hi Arun, thank you. So I guess there isn't an "easy way out"? Meaning, if I want to do this do I have to first import all the datasets with the second column "as.character" and then do this only if the columns have an exponential term? – user2337032 Jun 24 '13 at 11:38
  • @user2337032, you can use `readLines` instead. I've posted an answer. – Arun Jun 24 '13 at 11:59
  • Thank you Arun your function works too! – user2337032 Jun 24 '13 at 11:59

3 Answers3

6

I think the problem is that the form your exponentials are written in doesn't match the R style. If you read them in as character vectors you can convert them to exponentials if you know they all are exponentials. Use gsub to strip out the "10^(" and the ")", leaving you with the "-4", convert to numeric, then convert back to an exponential. Might not be the fastest way, but it works.

From your example:

 df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("10^(-4)","10^(-4)","10^(-4)","10^(-4)"))
 df$V2 <- 10^(as.numeric(gsub("10\\^\\(|\\)", "", df$V2)))
 df
#  V1    V2
#1 s1 1e-04
#2 s2 1e-04
#3 s3 1e-04
#4 s4 1e-04

Whats happening in detail: gsub("10\\^\\(|\\)", "", df$V2) is substituting 10^( and ) with an empty string (you need to escape the carat and the parentheses), as.numeric() is converting your -4 string into the number -4, then you're just running 10^ on each element of the numeric vector you just made.

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Bill Beesley
  • 118
  • 4
  • Thank you, this works. I just first imported as character and grepped to check if there was the "^", then used your code. Thank you! – user2337032 Jun 24 '13 at 11:58
6

If you read in your data.frame with stringsAsFactors=FALSE, the column in question should come in as a character vector, in which case you can simply do:

transform(df, V2=eval(parse(text=V2)))
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • This is much simpler! Just one question, how would I use this if I had colnames (so header=TRUE) and wanted to substitute that second column? This gives me an error: df[,2] <- transform(df, V2=eval(parse(text=df[,2]))). Thank you!! – user2337032 Jun 24 '13 at 12:57
  • Warning message: In `[<-.data.frame`(`*tmp*`, , 2, value = list(V1 = c("s1", : provided 2 variables to replace 1 variables – user2337032 Jun 24 '13 at 13:11
  • can you post the output of `dput(head(df, 5))`? – Matthew Plourde Jun 24 '13 at 13:13
  • yes sure, > dput(head(df, 5)) structure(list(V1 = structure(1:4, .Label = c("s1", "s2", "s3", "s4"), class = "factor"), V2 = structure(c(1L, 1L, 1L, 1L), .Label = "10^(-4)", class = "factor")), .Names = c("V1", "V2"), row.names = c(NA, 4L), class = "data.frame") – user2337032 Jun 24 '13 at 13:20
  • Your V2 column as a factor. You'll either have to convert to character, or read it in again with `as.is=TRUE`. The reason you're getting the error is because you want `df <- transform(df, V2=eval(parse(text=as.character(V2))))`, not `df[, 2] <- ...` – Matthew Plourde Jun 24 '13 at 13:23
  • Sorry… Ths is perfect, simple and elegant solution. I wish I could give you a check too..I used a combination of all these answers, i.e. I checked if there were these characters with readLines, then if there were I imported the data as characters and used your function to convert. Thank you to all!! – user2337032 Jun 24 '13 at 13:27
  • that's ok, maybe next time ;) – Matthew Plourde Jun 24 '13 at 13:30
3

You could use readLines to first load in the data and do all the operations required and then use read.table with textConnection as follows:

tt <- readLines("~/tmp.txt")
tt <- gsub("10\\^\\((.*)\\)$", "1e\\1", tt)
read.table(textConnection(tt), sep="\t", header=TRUE, stringsAsFactors=FALSE)
  V1    V2
1 s1 1e-04
2 s2 1e-04
3 s3 1e-04
4 s4 1e-04
Arun
  • 116,683
  • 26
  • 284
  • 387
  • This could be faster actually than reading all of the file first and it is better for me since my files are huge! Thank you! – user2337032 Jun 24 '13 at 12:03
  • Still, I think there is a problem with re-importing as a dataframe with separate columns: tt <- readLines("~/tmp.txt"); if(all(grep("10\\^\\((.*)\\)$", tt))) {tt <- gsub("10\\^\\((.*)\\)$", "1e\\1", tt)}; df <- read.table(textConnection(tt), sep="\t", header=TRUE, stringsAsFactors=FALSE) – user2337032 Jun 24 '13 at 12:36
  • that's because you are using `\(` instead of `\\(` and `\)` instead of `\\)`. You should escape twice. – Arun Jun 24 '13 at 12:43