-1

I have a huge dataframe (600,000 x 12,000) and I need to replace some values. I have tried as below, but it takes more than 3 hours:

mydata[mydata = “AA”] <- 0
mydata[mydata = “AB”] <- 1
mydata[mydata = “BA”] <- 1
mydata[mydata = “BB”] <- 2
mydata[mydata = “--”] <- 5

I also tried this, but doesn’t work:

mydata <- as.data.frame(apply(mydata, function(x){replace(x, x == "AA",0)}))
mydata <- as.data.frame(lapply(mydata, function(x){replace(x, x == "AB",1)}))
mydata <- as.data.frame(lapply(mydata, function(x){replace(x, x == "BA",1)}))
mydata <- as.data.frame(lapply(mydata, function(x){replace(x, x == "BB",2)}))
mydata <- as.data.frame(lapply(mydata, function(x){replace(x, x == "--",5)}))

Any help? Thanks.

Roland
  • 127,288
  • 10
  • 191
  • 288
PaulaF
  • 393
  • 3
  • 17
  • Just out of curiosity, how much RAM do you have? I failed to create such data.frame on 32 Gb machine. – cyberj0g Jun 18 '15 at 06:29
  • @cyberJ0g You are right. I wasn't thinking about the size – akrun Jun 18 '15 at 06:33
  • Could it be that you are using data.table? For data.frame the equal signs would ve wrong... – bdecaf Jun 18 '15 at 06:35
  • 3
    Please create a [reproducible example](http://stackoverflow.com/a/5963610/1412059) and show expected in- and output. Your code should throw an error. – Roland Jun 18 '15 at 07:06
  • PaulaF, you can still use R. Load/create the data in chunks if possible. For example, subset the DF into 10 6e4 x 1.2e4 chunks. – Jacob H Jun 19 '15 at 06:09

2 Answers2

1

For me it looks like you have a factor here and I think it might be better to work with renaming your factors. I found this nice page where they give some examples how you can do that. If you want to end up with a numerical column you could think about doing something like as.numerical() after you replaced your levels.

Sarina
  • 548
  • 3
  • 10
  • 1
    Please avoid to post single link without posting as answer relevant code. –  Jun 18 '15 at 06:31
1

As mentioned in the comments, the data.frame requested is rather big to fit in memory of a reasonable desktop machine, and perhaps R is not the tool for this job.

In any case, for a data.frame 1000 times smaller than requested, here is one way to do it.

First simulate some data:

set.seed(10001)
mydata = as.data.frame(matrix(sample(c("AA", "AB", "BA", "BB", "--"), 7200, replace = T), 
                       nrow = 600, ncol = 12))

head(mydata)
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 BA AB AB BA BB BB BA AA BA  BA  AA  BA
2 BB AB AA BA AA AA BB AB --  --  AA  --
3 AB -- -- BB BB -- BA AA AB  BA  AA  AB
4 -- BB BA AB BB BA BA BB AA  --  BA  BA
5 BB AA BA BB -- BA AB BB AA  BB  BB  --
6 AB -- AA BB BB BA -- -- AB  --  AA  AB

Then transform each column of the data.frame using apply together with the mapvalues function from the plyr package:

library(plyr)

# Vectors of values to transform
from_this = c("AA", "AB", "BA", "BB", "--")
to_this = c(0, 1, 1, 2, 5)

# Apply mapvalues to each column of data.frame
## I'm assuming that you want the new values to be of numeric type
new_mydata = apply(mydata, 2, 
                   function(x) as.numeric(as.character(mapvalues(x, from_this, to_this))))

This gives:

head(new_mydata)
     V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
[1,]  1  1  1  1  2  2  1  0  1   1   0   1
[2,]  2  1  0  1  0  0  2  1  5   5   0   5
[3,]  1  5  5  2  2  5  1  0  1   1   0   1
[4,]  5  2  1  1  2  1  1  2  0   5   1   1
[5,]  2  0  1  2  5  1  1  2  0   2   2   5
[6,]  1  5  0  2  2  1  5  5  1   5   0   1
hugot
  • 946
  • 6
  • 8