2

I have an extremely large dataframe, and I need to replace different values in the dataframe. Over time I have written a couple of different ways to replace the values that I need to change. Here is a subset of data so you can see what I am talking about

df <- structure(list(CHROM = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"), POS = c(619L, 
668L, 744L, 745L, 1064L, 1099L, 1121L, 1123L, 1126L, 1193L, 1208L, 
1214L, 1250L, 1265L, 1274L, 1277L, 1283L, 1307L, 1314L, 1325L
), `GEN[D86396].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0"), `GEN[D86397].GT` = c("0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0"), `GEN[D00105].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0"), `GEN[D00151].GT` = c("0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0"), `GEN[D00188].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0"), `GEN[D00220].GT` = c("0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0"), `GEN[D00257].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0"), `GEN[D00258].GT` = c("0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0"), `GEN[D00264].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0"), `GEN[D00268].GT` = c("0/0", 
"0/0", "0/0", "0/0", "0/1", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/1", 
"0/0"), `GEN[D00269].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0"), `GEN[D00270].GT` = c("0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0"), `GEN[D00271].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0"), `GEN[D00276].GT` = c("0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0"), `GEN[D00280].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0"), `GEN[D00282].GT` = c("0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/1", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0"), `GEN[D00285].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0"), `GEN[D00315].GT` = c("0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0"), `GEN[D00316].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0"), `GEN[D00319].GT` = c("0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0"), `GEN[D00339].GT` = c("0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", 
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0")), row.names = c(NA, 
20L), class = "data.frame")

So what I need to do is convert the "0/0" into 0, "0/1" into 1, "1/1" into 2, and "./." into 0.1 (which I don't think there are any in this example).

In the past, I have used the following

replacement<-function(x){
  x=replace(x,which(x=='./.'),0.01) 
  x=replace(x,which(x=='0/0'),0)
  x=replace(x,which(x=='0/1'),1)
  x=replace(x,which(x=='1/1'),2)
}
df=apply(df,2,replacement)
df <- as.data.frame(df)

Which is okay, but it still takes hours to run. I have also used this.

df <- df %>% mutate_at(
  vars(- CHROM, - POS),
  funs(case_when(
    . == "0/0" ~ 0,
    . == "0/1" ~ 1,
    . == "1/1" ~ 2,
    . == "./." ~ 0.01
  ))
)

Which is also just okay. I realize, with large datasets, somethings will take a long time to run. I am just curious what the fastest method out there for replacing values. I have seen a lot of other posts asking a similar question regarding NAs, but I haven't been able to find any questions related to mine. I think using data.table might be the fastest method? Or maybe converting the dataframe to a matrix? I am sure what you think.

Thanks in advance!

slava-kohut
  • 4,203
  • 1
  • 7
  • 24
neuron
  • 1,949
  • 1
  • 15
  • 30
  • 1
    Hours? Curious how big your dataframe is. That might help to identify the most appropriate solution. – Jon Spring Sep 12 '19 at 20:23
  • When it gets loaded into R, it is about 100 gigs worth of data. It's about 350 columns and millions of rows (not too sure about the exact number) – neuron Sep 12 '19 at 20:43
  • 1
    if this is taking long, probably add a dictionary or any mapping at the end of the file and using either notepad++, or perl, awk, or even sed to do a regex replacement. then read the file after it has been replaced into R. Or if you can collapse the whole dataframe within R into a single line eg `do.call(paste, c(df,collapse='\n'))` then determine the size and see whether this is a manageable object. Use regex to replace then convert back to dataframe – Onyambu Sep 12 '19 at 20:46
  • 1
    huh this file is big. Use the command window. – Onyambu Sep 12 '19 at 20:50
  • Do you have any suggestions to run in the command window? – neuron Sep 12 '19 at 20:51
  • 1
    are you using windows? – Onyambu Sep 12 '19 at 20:52
  • mac OS or Linux – neuron Sep 12 '19 at 20:54
  • 2
    do you have sed installed? just type sedin the command window and see what happens or even type perl. also you could look at [this link](https://stackoverflow.com/questions/6951687/find-and-replace-text-in-a-47gb-large-file) though DO NOT USE THE `-i` option as this is an inplace option. You do this ONLY WHEN YOU ARE SURE ABOUT THE RESULTS otherwise you will loose your original file – Onyambu Sep 12 '19 at 20:56
  • I have sed and perl install on the cluster I use and my laptop (depending on home much memory is required) – neuron Sep 12 '19 at 20:59
  • 1
    dont know what that means, but you can now look at the link and many more. Ensure your regex works in toy examples before using the `-i` option in the new example – Onyambu Sep 12 '19 at 21:03
  • Sorry, I didn't see you edited your post. I will check it out! – neuron Sep 12 '19 at 21:05
  • 1
    FYI `data.table::fread` [can take strings with unix commands and read the output directly](https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread), so once you have a working sed/awk/whatever command you can put it in fread to read it directly in the future without leaving R – IceCreamToucan Sep 12 '19 at 21:08
  • Oh.. very cool! Thanks for letting me know!! I will definitely try that. – neuron Sep 12 '19 at 21:13

2 Answers2

1

A fast and easy solution is to use a lookup table:

lookup_table <- c("0/0" = 0, "0/1" = 1, "1/1" = 2, "./." = 0.1)
df[-(1:2)] <- lapply(df[-(1:2)], function(x) lookup_table[x])

Equivalent (might use less max memory):

for (j in 3:length(df)) df[[j]] <- lookup_table[df[[j]]]

Benchmark

N <- 100e3
M <- 340
df <- data.frame(CHROM = 1, POS = seq_len(N))
for (j in 3:M) df[[j]] <- sample(c("0/0", "0/1", "1/1", "./."), N, TRUE)

system.time({
  lookup_table <- c("0/0" = 0, "0/1" = 1, "1/1" = 2, "./." = 0.01)
  df2 <- df
  df2[-(1:2)] <- lapply(df2[-(1:2)], function(x) lookup_table[x])
})
# 1.5 sec

system.time({
  replacement <- function(x) {
    x = replace(x, which(x == './.'), 0.01)
    x = replace(x, which(x == '0/0'), 0)
    x = replace(x, which(x == '0/1'), 1)
    x = replace(x, which(x == '1/1'), 2)
  }
  df3 <- as.data.frame(apply(df, 2, replacement), stringsAsFactors = FALSE)
})
# 4.5 sec

library(dplyr)
system.time({
  df4 <- df %>% mutate_at(
    -(1:2),
    ~ case_when(
      . == "0/0" ~ 0,
      . == "0/1" ~ 1,
      . == "1/1" ~ 2,
      . == "./." ~ 0.01
    )
  )
})
# 5.2 sec
F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • This works as well! I am going to do a performance test to see which method is the fastest! Thanks for posting your answer!! I appreciate the time you took to answer my question. – neuron Sep 13 '19 at 12:55
1

I looked at some sed commands and I figured I should post what I found just in case someone has a similar issue.

The sed commands that I found to work in terminal are (This creates a new file, but you don't have to create new files)

sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g R.test.txt > R.test.edit.txt

or this works as well in R

system(paste(sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g R.test.txt > R.test.edit.txt))

You can also use the data.table::fread method mentioned by IceCreamToucan

df <- fread("sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g' /R/R.test.txt")

It interesting to note that typically the sed command you use is

sed 's/old text/new text/g' file > new.file

but since what I needed to replace had a forward slash already / I had to use the + plus sign so sed doesn't get confused.

I am going to do a performance test using my two older methods (posted above), the new sed method, and F. Prive's method that he posted as an answer. I am going to make a smaller subset of the full dataset because it would take too long to test the four methods.

EDIT

So I tested the four different methods out to see which one was fastest. I created a smaller file to test the four methods out. The file I created had 1000000 rows and 340 columns.

METHOD 1

lookup_table <- c("0/0" = 0, "0/1" = 1, "1/1" = 2, "./." = 0.1)
df[-(1:2)] <- lapply(df[-(1:2)], function(x) lookup_table[x])

Runtime - 8 minutes

METHOD 2

replacement<-function(x){
  x=replace(x,which(x=='./.'),0.01) 
  x=replace(x,which(x=='0/0'),0)
  x=replace(x,which(x=='0/1'),1)
  x=replace(x,which(x=='1/1'),2)
}
df=apply(df,2,replacement)
df <- as.data.frame(df)

Runtime - 46 seconds

METHOD 3

df <- df %>% mutate_at(
  vars(- CHROM, - POS),
  funs(case_when(
    . == "0/0" ~ 0,
    . == "0/1" ~ 1,
    . == "1/1" ~ 2,
    . == "./." ~ 0.01
  ))
)

Runtime - 42 seconds

METHOD 4

df <- fread("sed -e 's+0/0+0+g' -e 's+0/1+1+g' -e 's+1/1+2+g' -e 's+./.+0.01+g' /R/R.test.txt")

Runtime - 2 min 34 seconds, which was surprising

Conclusion - I wasted my time

neuron
  • 1,949
  • 1
  • 15
  • 30
  • FYI the “> newname” at the end isn’t required for fread (unless you want to create a new txt file), it will read the output directly without the need to create a new file at the OS level. – IceCreamToucan Sep 13 '19 at 15:02
  • I am actually getting some weird errors using `fread`. I am going to make a different post about it because I think it separate from the original question. I will add a link for you if you are interested – neuron Sep 13 '19 at 18:01
  • Do you still get errors if you remove the "> newfilename" at the end? – IceCreamToucan Sep 13 '19 at 18:05
  • When I remove the `> newfilename` the output is in the terminal window. It doesn't alter the actual file itself. So after I run the fread("sed 's+0/0+0+g' R.test.txt') and then do head R.test.txt, nothing ends up changing. However, since the output is in the terminal window, I can see it did change what I wanted, but didn't alter the file... file that makes sense? – neuron Sep 13 '19 at 18:10
  • https://stackoverflow.com/questions/57928576/strange-error-from-data-tablefread-using-sed Here is the post I just made – neuron Sep 13 '19 at 18:19
  • @Brian There is no way method 1 is taking so much time as compared to methods 2 & 3. – F. Privé Sep 16 '19 at 20:47
  • @F.Privé I did think it was a bit high, but I just reran the script and the time for method 1 was 7 min 57 seconds. I don't know... I thought it was high too. – neuron Sep 17 '19 at 12:51
  • @F.Privé so strange that our results aren't matching up. That actual file has ~30 columns before the columns with the `./.`, `0/0`, `0/1', and `1/1`. Do you think that has to do with the differences in time? – neuron Sep 18 '19 at 13:59