Replacing factor levels more efficiently in a huge file

Question

I have a file with 800000 rows and 13000 columns. The file looks like:

        ID1 ID2 ID3 ID4 ID5
SNP1    AA  AA  AB  AA  BB
SNP2    AB  AA  BB  AA  AA
SNP3    BB  BB  BB  AB  BB
SNP4    AA  AA  BB  BB  AA
SNP5    AA  AA  AA  AA  AA

I want to replace the letters by numbers (AA = 0, AB = 1 and BB = 2). What I have done is: data[data=="AA"] = 0 It seems to be working fine in a small example, but it doesnt seem to do the job in the big file. It has taken hours. Is there a more efficient way to do it? Thank you very much. Paula.

With data of this size, you may even have trouble importing it into R. Maybe using some command line tools like `sed` might be a lot quicker. Also, have you looked at Bioconductor if you are working with genome data? — thelatemail, Apr 08 '15 at 04:51
...or an editor like `EmEditor` if you're on Windows. It's not free but has a free trial period. It handles those situations very well. — Dominic Comtois, Apr 08 '15 at 05:58

score 2 · Answer 1 · answered Apr 08 '15 at 02:54

Perhaps try this:

Read in your data:

DF <- read.table(text = "ID1 ID2 ID3 ID4 ID5
SNP1    AA  AA  AB  AA  BB
SNP2    AB  AA  BB  AA  AA
SNP3    BB  BB  BB  AB  BB
SNP4    AA  AA  BB  BB  AA
SNP5    AA  AA  AA  AA  AA
", header = TRUE, sep = "", stringsAsFactors = FALSE) 

> str(DF)
'data.frame':   5 obs. of  5 variables:
 $ ID1: chr  "AA" "AB" "BB" "AA" ...
 $ ID2: chr  "AA" "AA" "BB" "AA" ...
 $ ID3: chr  "AB" "BB" "BB" "BB" ...
 $ ID4: chr  "AA" "AA" "AB" "BB" ...
 $ ID5: chr  "BB" "AA" "BB" "AA" ...

Create a lookup table:

tab <- c("AA" = 0, "AB" = 1  , "BB" = 2)
> tab
AA AB BB 
 0  1  2

Some subassignment magic:

> DF[] <- tab[as.matrix(DF)]
> DF
     ID1 ID2 ID3 ID4 ID5
SNP1   0   0   1   0   2
SNP2   1   0   2   0   0
SNP3   2   2   2   1   2
SNP4   0   0   2   2   0
SNP5   0   0   0   0   0
> str(DF)
'data.frame':   5 obs. of  5 variables:
 $ ID1: num  0 1 2 0 0
 $ ID2: num  0 0 2 0 0
 $ ID3: num  1 2 2 2 0
 $ ID4: num  0 0 1 2 0
 $ ID5: num  2 0 2 0 0

thank you very much. It works in a small data set, but when I apply to the whole file I get the error: Error: long vectors not supported yet: memory.c:1093 — PaulaF, Apr 08 '15 at 03:30
@PaulaF I just noticed that you're talking about 800,000 * 13,000 = 10,400,000,000 = 10 billion entries. I think this sort of thing requires "big data" R tools such as those outlined at http://cran.r-project.org/web/views/HighPerformanceComputing.html. Also try googling your error and see what turns out. Perhaps someone else has a better memory-efficient solution. — Peter Diakumis, Apr 08 '15 at 03:51

score 2 · Accepted Answer · edited May 23 '17 at 11:50

File is likely too large for R, unless you use scan, which overcomplicates things IMO. This is a job better handled using GNU utilities.

If you're in Windows install MSYS:

http://www.mingw.org/wiki/Getting_Started

Then use sed as mentioned to replace text:

cat <filename>  | sed "s/\bAA\b/0/g" | sed "s/\bBA\b/1/g" | sed "s/\bAB\b/1/g"  | sed "s/\bBB\b/2/g" > <newfile>

Edit:

If you must use R, you will likely need to read file line-by-line as file contains ~10 billion entries, which each of 3 chars is a very large dataset indeed!

See SO thread here for reading file line-by line:

reading a text file in R line by line

However, I suspect this will be very slow.

Thank you very much @Vince. I am using Linux and the command sed worked perfectly. You have no idea how you helped me. Thanks again. — PaulaF, Apr 09 '15 at 23:38

score 1 · Answer 3 · answered Apr 08 '15 at 14:51

Assuming you have managed to open your file and assuming it is a data.frame with factor columns, you can use the fact that factors are already numeric columns numbered from 1:

DF <- read.table(text = "ID1 ID2 ID3 ID4 ID5
SNP1    AA  AA  AB  AA  BB
SNP2    AB  AA  BB  AA  AA
SNP3    BB  BB  BB  AB  BB
SNP4    AA  AB  BB  BB  AA
SNP5    AA  AA  AA  AA  AA
", header = TRUE, sep = "") 

for (i in seq_along(DF)) {
  # check if the column levels are ordered correctly; if not
  # relevel the column
  if (!identical(levels(DF[[i]]), c("AA", "AB", "BB"))) {
    warning("Levels do not match in column ", i, ". Relevelling.")
    DF[[i]] <- factor(DF[[i]], levels=c("AA", "AB", "BB"))
  }
  # remove the class of the column: this basically makes an integer
  # column from the factor
  attr(DF[[i]], "class") <- NULL
  # substract 1 to get number from 0
  DF[[i]] <- DF[[i]] - 1
}

The code checks if the levels are numbered correctly and relevels when necessary. Hopefully this doesn't happen to often as this will slow things down.

It could be that your file does not fit into memory which will cause Windows/Linux/... to use the disk for memory storage. This will slow things down considerably. In that case you are probably better of using packages such as ff or bigmemory.

Replacing factor levels more efficiently in a huge file

3 Answers3