We need to fill in a classification data table. I tend to write for loops a little too much, I'm trying to figure out how to do it with apply()
. I'm scanning the last column to find a non-missing value, then filing in each column with the value above it, only on a diagonal. So if there are 3 columns, this would fill in the values for the last column. I'd repeat it for each 'higher taxonomic level' or the next column to the left:
# fills in for Family-level taxonomy
for(i in nrows(DataFrame)){
if(is.na(DataFrame[[4]][i])) next
else {
DataFrame[[3]][i] <- DataFrame[[3]][i-1]
DataFrame[[2]][i] <- DataFrame[[2]][i-2]
DataFrame[[1]][i] <- DataFrame[[1]][i-3]
}
}
# Repeat to fill in Order's higher taxonomy (Phylum and Class)
for(i in nrows(DataFrame)){ # fills in for Family
if(is.na(DataFrame[[3]][i])) next
else {
DataFrame[[2]][i] <- DataFrame[[2]][i-2]
DataFrame[[1]][i] <- DataFrame[[1]][i-3]
}
}
# And again for each column to the left.
the data may look like:
Phylum Class Order Family
Annelida
Polychaeta
Eunicida
Oenoidae
Onuphidae
Oweniida
Oweniidae
This will then repeat for each unique Family in that Order, and each Unique Order in Class, and each Unique Class in Phylum. Essentially, we need to fill in the values to the left of each non-missing value, from the next non-missing value above it. So the end result would be:
Phylum Class Order Family
Annelida
Annelida Polychaeta
Annelida Polychaeta Eunicida
Annelida Polychaeta Eunicida Oenoidae
Annelida Polychaeta Eunicida Onuphidae
Annelida Polychaeta Oweniida
Annelida Polychaeta Oweniida Oweniidae
We can't just copy down the columns since once we get to new phylum level, copying down the class stops with one missing value, order may have two missing values, etc...
I guess the challenge is that I need the value of Dataframe[[ j ]][ i-n ] in whatever function I would pass to apply. When apply passes 'x' into the function, does it pass an object with attributes (like index/row name) or simply the value??
Or is this a wasted line of thought, do it with for loops and use rcpp if I really need speed. This is done annually dataframe has ~8,000 rows and 13 columns we'd operate over. I don't think performance would be an issue... but we haven't tried yet. Not sure why.