Assuming you have managed to open your file and assuming it is a data.frame
with factor
columns, you can use the fact that factors are already numeric columns numbered from 1
:
DF <- read.table(text = "ID1 ID2 ID3 ID4 ID5
SNP1 AA AA AB AA BB
SNP2 AB AA BB AA AA
SNP3 BB BB BB AB BB
SNP4 AA AB BB BB AA
SNP5 AA AA AA AA AA
", header = TRUE, sep = "")
for (i in seq_along(DF)) {
# check if the column levels are ordered correctly; if not
# relevel the column
if (!identical(levels(DF[[i]]), c("AA", "AB", "BB"))) {
warning("Levels do not match in column ", i, ". Relevelling.")
DF[[i]] <- factor(DF[[i]], levels=c("AA", "AB", "BB"))
}
# remove the class of the column: this basically makes an integer
# column from the factor
attr(DF[[i]], "class") <- NULL
# substract 1 to get number from 0
DF[[i]] <- DF[[i]] - 1
}
The code checks if the levels are numbered correctly and relevels when necessary. Hopefully this doesn't happen to often as this will slow things down.
It could be that your file does not fit into memory which will cause Windows/Linux/... to use the disk for memory storage. This will slow things down considerably. In that case you are probably better of using packages such as ff
or bigmemory
.