3

We need to fill in a classification data table. I tend to write for loops a little too much, I'm trying to figure out how to do it with apply(). I'm scanning the last column to find a non-missing value, then filing in each column with the value above it, only on a diagonal. So if there are 3 columns, this would fill in the values for the last column. I'd repeat it for each 'higher taxonomic level' or the next column to the left:

# fills in for Family-level taxonomy
for(i in nrows(DataFrame)){  
  if(is.na(DataFrame[[4]][i])) next
    else {
      DataFrame[[3]][i] <- DataFrame[[3]][i-1]
      DataFrame[[2]][i] <- DataFrame[[2]][i-2]
      DataFrame[[1]][i] <- DataFrame[[1]][i-3]
     }
}

# Repeat to fill in Order's higher taxonomy (Phylum and Class)
for(i in nrows(DataFrame)){  # fills in for Family
  if(is.na(DataFrame[[3]][i])) next
    else {
      DataFrame[[2]][i] <- DataFrame[[2]][i-2]
      DataFrame[[1]][i] <- DataFrame[[1]][i-3]
     }
}
# And again for each column to the left.

the data may look like:

Phylum     Class       Order        Family  
Annelida   
           Polychaeta  
                       Eunicida
                                    Oenoidae
                                    Onuphidae     
                       Oweniida
                                    Oweniidae

This will then repeat for each unique Family in that Order, and each Unique Order in Class, and each Unique Class in Phylum. Essentially, we need to fill in the values to the left of each non-missing value, from the next non-missing value above it. So the end result would be:

Phylum     Class       Order    Family  
Annelida   
Annelida  Polychaeta  
Annelida  Polychaeta  Eunicida
Annelida  Polychaeta  Eunicida Oenoidae
Annelida  Polychaeta  Eunicida Onuphidae     
Annelida  Polychaeta  Oweniida
Annelida  Polychaeta  Oweniida Oweniidae

We can't just copy down the columns since once we get to new phylum level, copying down the class stops with one missing value, order may have two missing values, etc...
I guess the challenge is that I need the value of Dataframe[[ j ]][ i-n ] in whatever function I would pass to apply. When apply passes 'x' into the function, does it pass an object with attributes (like index/row name) or simply the value??

Or is this a wasted line of thought, do it with for loops and use rcpp if I really need speed. This is done annually dataframe has ~8,000 rows and 13 columns we'd operate over. I don't think performance would be an issue... but we haven't tried yet. Not sure why.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Kevin
  • 177
  • 13
  • 2
    You can "copy down the columns" within a group. In this case, you could group by `cumsum(!is.na(DF$Family))` or similar. Packages data.table and dplyr are well suited to modifying by group. If you provide a reproducible example (e.g., `dput` of your "the data may look like"), someone could illustrate how. Guidance here: http://stackoverflow.com/a/28481250/1191259 – Frank Oct 16 '15 at 19:39
  • 3
    Why not use a tree structure instead? If you have so many common entries, perhaps it's better to express it as a set and then generate the table values if needed. – CinchBlue Oct 16 '15 at 19:42
  • Why is row 6 in the *Family* column empty? – Jaap Oct 16 '15 at 20:50
  • RE Row 6 in the Family column: We have some values that stop with Order name. It'll occur at the class and phylum level too (all the way down to species), – Kevin Oct 19 '15 at 14:48
  • Ok, I wasn't sure. In that case @jeremycg's answer is the better one. – Jaap Oct 19 '15 at 18:16

3 Answers3

2

Here's my method, as long as your data looks like I'm guessing:

library(tidyr)
library(dplyr)
data[data == ""] <- NA
data %>% fill(-Family) %>%
         filter(!is.na(Family)) 

output:

    Phylum      Class    Order    Family
1 Annelida Polychaeta Eunicida  Oenoidae
2 Annelida Polychaeta Eunicida Onuphidae
3 Annelida Polychaeta Oweniida Oweniidae

If you want the empty rows, you can try this, which allows for arbitrary nesting and unnesting:

data %>% fill(-Family) %>%
  filter(!is.na(Family)) %>%
  do(plyr::rbind.fill(unlist(lapply(1:nrow(.), function(z) lapply(1:4, function(xx) .[z,][1:xx])), recursive = FALSE))) %>%
  distinct()

     Phylum      Class    Order    Family
1  Annelida       <NA>     <NA>      <NA>
2  Annelida Polychaeta     <NA>      <NA>
3  Annelida Polychaeta Eunicida      <NA>
4  Annelida Polychaeta Eunicida  Oenoidae
5  Annelida Polychaeta Eunicida Onuphidae
6  Annelida Polychaeta Oweniida      <NA>
7  Annelida Polychaeta Oweniida Oweniidae
8  Annelida       blah     <NA>      <NA>
9  Annelida       blah     blah      <NA>
10 Annelida       blah     blah      blah

dput of data:

structure(list(Phylum = c("Annelida", NA, NA, NA, NA, NA, NA, 
NA, NA, NA), Class = c(NA, "Polychaeta", NA, NA, NA, NA, NA, 
"blah", NA, NA), Order = c(NA, NA, "Eunicida", NA, NA, "Oweniida", 
NA, NA, "blah", NA), Family = c(NA, NA, NA, "Oenoidae", "Onuphidae", 
NA, "Oweniidae", NA, NA, "blah")), .Names = c("Phylum", "Class", 
"Order", "Family"), row.names = c(NA, -10L), class = "data.frame")
jeremycg
  • 24,657
  • 5
  • 63
  • 74
  • Thanks Jeremycg!!! The empty rows are important, there are codes associated with those values. This seems to have worked great! – Kevin Oct 19 '15 at 16:30
  • Just as a heads up, this produces every 'empty' combination - it's starting from the condensed output, and regenerating from there. If you only want the 'empty' ones in your data, add some more data and an explanation (in a new question probably). Also, If you have some that finish at Order (and never make it to family), it will skip these too. – jeremycg Oct 19 '15 at 17:02
1

As an alternative to the other solutions, you can also use the na.locf function from the zoo package which replaces NA-values with the last observation (locf = last observation carried forward).

# replace empty spaces with NA values
df[df == ""] <- NA

# use na.locf to replace the NA values    
library(zoo)
df <- na.locf(df)

this results in:

> df
    Phylum      Class    Order    Family
1 Annelida       <NA>     <NA>      <NA>
2 Annelida Polychaeta     <NA>      <NA>
3 Annelida Polychaeta Eunicida      <NA>
4 Annelida Polychaeta Eunicida  Oenoidae
5 Annelida Polychaeta Eunicida Onuphidae
6 Annelida Polychaeta Oweniida Onuphidae
7 Annelida Polychaeta Oweniida Oweniidae

Used data:

df <- structure(list(Phylum = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Annelida"), class = "factor"), 
                     Class = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Polychaeta"), class = "factor"), 
                     Order = structure(c(1L, 1L, 2L, 1L, 1L, 3L, 1L), .Label = c("", "Eunicida", "Oweniida"), class = "factor"), 
                     Family = structure(c(1L, 1L, 1L, 2L, 3L, 1L, 4L), .Label = c("", "Oenoidae", "Onuphidae", "Oweniidae"), class = "factor")), 
                .Names = c("Phylum", "Class", "Order", "Family"), class = "data.frame", row.names = c(NA, -7L))
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • I like the simplicity here, but we need gaps at times... This is definitely going in the tool box. – Kevin Oct 19 '15 at 16:34
0

Here's one way:

x <- matrix(rnorm(100), 10,10)
x <- cbind(1:nrow(x), x)

output <- apply(x, 1, function(i) {
  rowID <- as.numeric(i[1])
  x_orig <- unlist(i[-1])
  ## ... do some more stuff
  return(...something...)
})
alexwhitworth
  • 4,839
  • 5
  • 32
  • 59