2

I have an already ordered data frame that looks like the following:

mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))

  ID   Level                      Taxonomy Letter
1 A1  domain                  D__Eukaryota      D
2 A1 kingdom             K__Chloroplastida      K
3 A1  phylum                          <NA>      P
4 A1   class            C__Mamiellophyceae      C
5 A1   order                          <NA>      O
6 A1  family                          <NA>      F
7 A1   genus               G__Crustomastix      G
8 A1 species S__Crustomastix sp. MBIC10709      S

What I would like is to replace the NA values with the last non-NA value, adding the ALL the Letters "missed" at the beginning in a rolling fashion... See what I mean below.

The goal is to obtain a data frame like this:

  ID   Level                      Taxonomy Letter
1 A1  domain                  D__Eukaryota      D
2 A1 kingdom             K__Chloroplastida      K
3 A1  phylum          P__K__Chloroplastida      P
4 A1   class            C__Mamiellophyceae      C
5 A1   order         O__C__Mamiellophyceae      O
6 A1  family      F__O__C__Mamiellophyceae      F
7 A1   genus               G__Crustomastix      G
8 A1 species S__Crustomastix sp. MBIC10709      S

Notice the last 2 NAs, how the last one has to carry the value of the previous. See how the first one of the two starts with O__C and the last one with F__O__C.

So far, my best attempt is the following (thanks to Ajay Ohri):

library(zoo)
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
mydf <- data.frame(lapply(mydf, as.character), stringsAsFactors=FALSE)
mydf$Letter2 <- ifelse(is.na(mydf$Taxonomy),paste(mydf$Letter,'__',sep=''),"")
mydf
mydf$Taxonomy <- paste(mydf$Letter2, na.locf(mydf$Taxonomy), sep='')
mydf

Notice how I still don't manage to do it in a rolling manner (I get F__C instead of F__O__C for the last NA). Any help? Thanks!

PS: let me know if it is still confusing, so I make another MWE with more NAs in a row, so it's more obvious what I need.

Uwe
  • 41,420
  • 11
  • 90
  • 134
DaniCee
  • 2,397
  • 6
  • 36
  • 59

5 Answers5

3

As the OP has mentioned that memory consumption is crucial, here is a data.table approach which uses the na.locf() function from the zoo package:

library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table, convert factors to characters
DT <- data.table(mydf)[, lapply(.SD, as.character)]
# set marker for NA rows 
DT[, na := is.na(Taxonomy)][]
# fill NA by Last Observation Carried Forward
DT[, Taxonomy := zoo::na.locf(Taxonomy)][]
# create list of Letters and unique row count within each group of missing taxonomies
DT[(na), `:=`(tmp = .(Letter), rn = seq_len(.N)), by = .(ID, Taxonomy)][]
# replace incomplete taxonomies
DT[(na), Taxonomy := paste(c(rev(unlist(tmp)[1:rn]), Taxonomy), collapse = "__"), 
   by = .(ID, Taxonomy, rn)][]
# clean up
DT[, c("na", "tmp", "rn") := NULL][]
   ID   Level                      Taxonomy Letter
1: A1  domain                  D__Eukaryota      D
2: A1 kingdom             K__Chloroplastida      K
3: A1  phylum          P__K__Chloroplastida      P
4: A1   class            C__Mamiellophyceae      C
5: A1   order         O__C__Mamiellophyceae      O
6: A1  family      F__O__C__Mamiellophyceae      F
7: A1   genus               G__Crustomastix      G
8: A1 species S__Crustomastix sp. MBIC10709      S

I've refrained from chaining the expressions, so the code can be executed step by step.

Note that data.table is updating in place without copying the whole data set which saves memory as well as time.

Prerequisites and additional explanations

In response to this comment, the OP has confirmed that the starting data frame is ordered and non-redundant and that ID+Level should be the unique key of the data frame.

However, as the solution above depends on these assumptions it is worthwhile to add some checks:

# (1) ID + Level are unique keys: find duplicate Levels per ID
stopifnot(anyDuplicated(DT, by = c("ID", "Level")) == 0L)
# (2) rows missing: count rows per ID, there should be 8 Levels
DT[, .N, by = ID][, stopifnot(all(N == 8L))]
# (3) order, wrong Level names, and tests (1) and (2) as well
# create data.table with Level in proper order and a sequence number ln
levels <- data.table(
  ln = 1:8,
  Level = c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
)
# left inner join, i.e., keep only rows with matching Level, keep order of DT
# then check for consecutively ascending level sequence numbers
levels[DT, on = "Level", nomatch = 0][, stopifnot(all(diff(ln) == 1L)), by = ID]

In addition, it has to be made sure that at least for the top Level "domain" the Taxonomy is specified. This can be doublechecked with:

# count number of rows with missing Taxonomy on top level "domain"
stopifnot(nrow(DT[Level == "domain" & is.na(Taxonomy)] == 0L))

The grouping logic by = .(ID, Taxonomy) is been used together with the selection on na, i.e. DT[(na), ..., in order to prepend the additional letters to Taxonomy where Taxonomywas originally missing. During development of the solution, I had introduced an additional helper column gn := rleid(ID, Taxonomy) which would cover duplicates as mentioned in this comment, Finally, I recognized that I can scrape this column because of the prerequisites.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • This can deal with my actual humongous data frame without R crashing, so I will accept this answer instead. Many thanks! – DaniCee Jun 28 '17 at 09:55
  • Well, I knew that `data.table` is efficient in terms of memory and speed, but I'm somewhat surprised because I had to add three additional temporary columns. BTW: How large is your production data set? – Uwe Jun 28 '17 at 10:21
  • 1
    It's 5,161,208 x 5. The Reduce method made R to terminate, but this one took relatively little time, no more than 5min or so... – DaniCee Jun 28 '17 at 10:26
  • I think in case of the same Taxonomy value in different rows the grouping logic (`by = .(ID, Taxonomy)]`) can lead to wrong results (mixing two different groups into one) with wrong letter prefixed. E. g. modify the data via `mydf <- rbind(mydf, mydf[2:3,])`. The OP assumes that the data is already ordered so this should never happen (but the OP did not specify the columns used for sorting) – R Yoda Jun 28 '17 at 19:43
  • 1
    ID+Level should be the unique key of the data frame, so I won't find the same Taxonomy in the starting data frame cause it is preceded by Letter, which is just the first letter of Level (unless Taxonomy is NA)... The order is defined by ID ascendantly and by Level following: "domain", "kingdom", "phylum", "class", "order", "family", "genus", "species". The starting data frame is ordered and non-redundant, like the one in the MWE. Should it be safe to use this solution then? – DaniCee Jun 29 '17 at 02:49
  • Thank you for clarification, DaniCee. I've updated my post to include tests of the prerequisistes and to answer the comment by @RYoda . – Uwe Jun 29 '17 at 07:04
2

One way to do this is to use Reduce with accumulate = TRUE argument. i.e.

ind <- is.na(mydf$Taxonomy)
mydf$Taxonomy <- zoo::na.locf(mydf$Taxonomy)
mydf$Taxonomy[ind] <- paste0(with(mydf, ave(Level, Taxonomy, FUN = function(i) 
     Reduce(paste, toupper(substr(rev(i), 1, 1)), accumulate = TRUE)))[ind], '_', 
                                                      sub('.*_', '', mydf$Taxonomy[ind]))

mydf$Taxonomy <- gsub(' ', '_', mydf$Taxonomy)


mydf
#  ID   Level                      Taxonomy Letter
#1 A1  domain                  D__Eukaryota      D
#2 A1 kingdom             K__Chloroplastida      K
#3 A1  phylum            P_K_Chloroplastida      P
#4 A1   class            C__Mamiellophyceae      C
#5 A1   order           F_O_Mamiellophyceae      O
#6 A1  family         F_O_C_Mamiellophyceae      F
#7 A1   genus               G__Crustomastix      G
#8 A1 species S__Crustomastix_sp._MBIC10709      S
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • It seems to work but I don't really understand what it's doing... I'm having problems trying to modify it to my needs... how can I get "F__O__C" instead of "c o f"? – DaniCee Jun 28 '17 at 07:02
  • 1
    Perfect thanks! Note that Letter is already "toupper(substr(Level, 1, 1))", so you can just use Letter. – DaniCee Jun 28 '17 at 07:12
  • Everytime that I try the command on my real, much bigger, data frame, I get an "R Session Aborted. R encountered a fatal error. The session was terminated" error... Any clue? – DaniCee Jun 28 '17 at 08:50
  • No idea. Start on a fresh session maybe? – Sotos Jun 28 '17 at 08:52
  • Yes, I've tried a couple of times... It's probably running out of memory – DaniCee Jun 28 '17 at 08:54
  • `Reduce` works "row by row" IMHO (by applying a function to two rows/elements of a vector). I wouldn't expect fast or memory efficient processing that way. Try to use the R profile feature on a smaller subset of your data to examine the memory impact – R Yoda Jun 28 '17 at 09:52
  • 1
    I switched to the data.table solution, I will make it the accepted answer since it can deal with my actual data. Thanks everyone! – DaniCee Jun 28 '17 at 09:54
1

Step 1

I would first create a column with an ifelse

data$colnew=ifelse(is.na(data$Taxonomy),"missed","")

if you did not intend to paste the word missed you can skip this step

Step 2 Take last value

from Replacing NAs with latest non-NA value (see other approaches here)

use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values

or new function

repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
    if (!forward) x = rev(x)           # reverse x twice if carrying backward
    ind = which(!is.na(x))             # get positions of nonmissing values
    if (is.na(x[1]) && !na.rm)         # if it begins with NA
        ind = c(1,ind)                 # add first pos
    rep_times = diff(                  # diffing the indices + length yields how often
        c(ind, length(x) + 1) )          # they need to be repeated
    if (maxgap < Inf) {
        exceed = rep_times - 1 > maxgap  # exceeding maxgap
        if (any(exceed)) {               # any exceed?
            ind = sort(c(ind[exceed] + 1, ind))      # add NA in gaps
            rep_times = diff(c(ind, length(x) + 1) ) # diff again
        }
    }
    x = rep(x[ind], times = rep_times) # repeat the values at these indices
    if (!forward) x = rev(x)           # second reversion
    x
}

function in also in formr package (Github only). https://github.com/rubenarslan/formr

Step3

Concatenate the two columns (newone) with df$Letter into a third using paste

Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60
  • If you look at the expected output, I guess OP wants to concatenate the value(s) from the column `Letter`, in a rolling manner, and not the string "missed". Understandable confusion from the way the question was phrased though – Aramis7d Jun 28 '17 at 04:48
  • rephrased it accordingly. I think step 2 is the one he wants and its almost a duplicate except for the pasting part – Ajay Ohri Jun 28 '17 at 04:53
  • @Aramis7d is right, I want a rolling concatenation, notice the 2 consecutive NAs, how the first is O__C and the second is F__O__C. I'm going to try your answer and let you know – DaniCee Jun 28 '17 at 05:48
  • So it seems your solution gets me F__C instead of F__O__C for the last NA, it is actually that "rolling" check that I don't seem to figure out... Let me edit the question to reflect that – DaniCee Jun 28 '17 at 05:53
1

Since you mentioned memory and performance problems you switched to the accepted data.table solution.

I am adding another data.table variant which does not depend on other packages like zoo and may be fast enough if the Taxonomy column does not contain too long sequences of NAs since the longest sequence determines the number of repetitions of the while loop (e. g. two rep. in case of the example data):

library(data.table)

mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))

setDT(mydf)

# Fill NA value in "Taxonomy" with the value of the prev. row until no NAs occur anymore
prev.number.NAs <- 0    # required to stop the loop if no more NA values can be carried forward
repeat {
  number.NAs <- sum(is.na(mydf$Taxonomy))
  if( number.NAs == 0 | number.NAs == prev.number.NAs) break;
  mydf[, filler := shift(Taxonomy), by = .(ID)]     # fill temporary working column with the value of the prev. row of the same group
  mydf[!is.na(filler) & is.na(Taxonomy), Taxonomy := paste0(Letter, "__", filler)]
  prev.number.NAs <- number.NAs
}


mydf[, filler := NULL]   # remove working column
mydf

Unfortunately the shift function of data.table does not offer a "last observation carry forward" parameter so that I had to use a while loop.

Update 1: As @UweBlock mentioned in his comment below I have replaced the while loop by a repeat loop to avoid an endless loop in case of an NA value in the column Taxonomy in the first row. THX for finding this!

Update 2: Carying forward the last observation is now only done within the same group of data (defined by the columns ID - as the OP indicated in a comment). Thx to @UweBlock for pointing out this issue!

R Yoda
  • 8,358
  • 2
  • 50
  • 87
  • 1
    It's hypothetical but watch out if Taxonomy is missing for any top level "domain". I guess this shouldn't happen in the OP's data but your code will carry over the last species to the next domain. And the 'while' loop won't terminate if Taxonomy is missing in the first rows of `mydf`. – Uwe Jun 29 '17 at 07:48
  • @UweBlock Your points are absolutely correct, thanks for pointing this out. Defining and assuring the preconditions is really important – R Yoda Jun 29 '17 at 14:40
0

An approach that fills NA values at the beginning with NAs, and also simplifies the logic to work with groups:

forward_fill <- function (x) {
  if (length(x) == 0) return (vector(mode(x), 0))      

  xt  <- tail(x, -1)  
  x0  <- c(x[1], xt[!is.na(xt)])
  id0 <- c(TRUE,    !is.na(xt))
  y   <- x0[cumsum(id0)]
  
  return (y)
}
Diego-MX
  • 2,279
  • 2
  • 20
  • 35