Speedy replacement of NA by strings using Data.Table or Rcpp in R

Question

I have a big table: 10M rows by 33 columns, of which 28 columns have some NA values. These NA values need to be patched using locf(). I read a few threads (efficiently locf by groups in a single R data.table and na.locf and inverse.rle in Rcpp) on this topic. However, these threads are about replacing numeric vectors. I am not too familiar with Rcpp so I don't know how to change their code to cater to strings---my data are all strings.

Here are my sample data:

Input Data

Sample_File = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113), 
    Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456", 
    "GH567", "HI678", "IJ789"), Name = c(NA, NA, NA, "Human Being", 
    NA, "Lion", NA, "Bird"), Family = c(NA, NA, NA, "Homo Sapiens", 
    NA, NA, NA, "Passeridae"), SL1_Continent = c("Asia", NA, 
    "Asia", "Asia", NA, NA, NA, "Australia"), SL2_Country = c("China", 
    "China", NA, NA, NA, NA, NA, "Australia"), SL3_Direction = c("East", 
    NA, "East", "East", NA, NA, NA, "West"), Expiration_FY = c(2021, 
    NA, 2018, NA, 2012, 2012, NA, 2012), Flag = c("Y", NA, "N", 
    "N", NA, NA, NA, "TBD"), Insured = c("No", NA, NA, NA, NA, 
    NA, NA, "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4, 
    44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO", 
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country", 
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue", 
"Quantity"), row.names = c(NA, 8L), class = "data.frame")

Here's my code using data.table:

data.table::setDT(Sample_File)
cols <- c("Name","Family","SL1_Continent","SL2_Country","SL3_Direction","Expiration_FY","Flag","Insured")
Sample_File[, (cols):=lapply(.SD, function(x){na.locf(x,fromLast = TRUE,na.rm=TRUE)}), by = SO, .SDcols = cols]

Expected Output:

Output = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113), 
    Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456", 
    "GH567", "HI678", "IJ789"), Name = c("Human Being", "Human Being", 
    "Human Being", "Human Being", "Lion", "Lion", "Bird", "Bird"
    ), Family = c("Homo Sapiens", "Homo Sapiens", "Homo Sapiens", 
    "Homo Sapiens", "Passeridae", "Passeridae", "Passeridae", 
    "Passeridae"), SL1_Continent = c("Asia", "Asia", "Asia", 
    "Asia", "Australia", "Australia", "Australia", "Australia"
    ), SL2_Country = c("China", "China", "China", "China", "Australia", 
    "Australia", "Australia", "Australia"), SL3_Direction = c("East", 
    "East", "East", "East", "West", "West", "West", "West"), 
    Expiration_FY = c(2021, 2018, 2018, 2021, 2012, 2012, 2012, 
    2012), Flag = c("Y", "N", "N", "N", "TBD", "TBD", "TBD", 
    "TBD"), Insured = c("No", "No", "No", "No", "Yes", "Yes", 
    "Yes", "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4, 
    44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO", 
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country", 
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue", 
"Quantity"), row.names = c(NA, -8L), class = "data.frame")

While the above code takes fraction of second to execute, it takes ~10 minutes to process one column in my original data-set, which translates to ~280 minutes to process 28 columns even with data.table.

I am assuming that I am not really utilizing the power of data.table above. I am not really sure. I'd sincerely appreciate any help to speed up na.locf() function.

Is there any more efficient method to replace NA above?

score 6 · Answer 1 · answered Mar 05 '17 at 12:02

I simplified the problem for the purpose of this example, but I guess it is easy enough to generalize. The code below defines the locppf function in Rcpp using C++11 syntax:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

using Map = std::unordered_map<double, int> ;
using Pair = Map::value_type ;

// [[Rcpp::export]]
CharacterVector locppf(NumericVector g, CharacterVector s) {
  auto n = g.size() ;
  CharacterVector out = clone(s) ;

  Map map ;
  for(int i=n-1; i>=0; i--){
    double value = g[i] ;
    auto it = map.find( value ) ;

    if( it == map.end() ){
      map.insert( Pair(value, i) ) ;
    } else {
      // if the current value is NA, replace it with the data at correct idx  
      auto current = s[i] ;
      if( CharacterVector::is_na( current ) ){
        out[i] = s[ it->second ] ;
      } else {
        it->second = i ;
      } 
    }
  }
  return out ;
}

The idea is to define a map to track the index of the last time we've seen something that was not NA in the group.I'm using std::unordered_map<double, int> as the map because your example also used a numeric vector for the group.

Let's break the relevant nuggets:

if( it == map.end() ){
  map.insert( Pair(value, i) ) ;
}

Here we check if the map has already seen the current value, and if not we retain the current index.

      auto current = s[i] ;
      if( CharacterVector::is_na( current ) ){
        out[i] = s[ it->second ] ;
      } else {
        it->second = i ;
      }

Here we check if the current value is NA with CharacterVector::is_na.

If it is we fill the result vector with the value that is in the index we retained earlier.

If not, we change the index that is remembered by the map for this group.

Now let's give ourselves some data:

library("zoo")
library("dplyr")
library("data.table")

with_holes <- function(x, p = .2){
  n <- length(x)
  x[ sample(n, n*p) ] <- NA
  x
}

n <- 1e6
x <- sample( as.numeric(1:100), n, replace= TRUE )
y <- with_holes( sample( letters, n, replace = TRUE) )
d <- data_frame( x = x, y = y )

And measure timing with various options:

Using dplyr syntax with group_by, mutate and na.locf

> system.time( d %>% group_by(x) %>% mutate( y = na.locf(y, fromLast = TRUE, na.rm = FALSE) ) )
  user  system elapsed 
  0.173   0.023   0.198

Using data.table syntax with na.locf. I don't guarantee this is the best data.table way to do this though.

> d2 <- as.data.table(d)
> system.time( d2[ , y := na.locf(y, fromLast = TRUE, na.rm = FALSE) , x ]  )
  user  system elapsed 
  0.159   0.030   0.188

With out custom locppf function:

> system.time( locppf(d$x, d$y) )
  user  system elapsed 
  0.028   0.001   0.028

Nice answer. Too bad we can't really microbenchmark it because of the reference semantics. — Dirk Eddelbuettel, Mar 05 '17 at 13:33
Found a way by making copies first. Then your Rcpp / C++11 solution is 3 to 4 times faster than data.table, and five times faster than dplyr. — Dirk Eddelbuettel, Mar 05 '17 at 14:04
@DirkEddelbuettel there is nothing here really from neither data.table or dplyr IIRC. It's just Rcpp na.locf vs. zoo::na.locf. I remember seeing a data.table version of na.locf using rolling joins written by eddie. I think [this one](http://stackoverflow.com/a/26181795/3001626) (untested). — David Arenburg, Mar 05 '17 at 14:23
Yes, I was about to make that comment to the OP's question -- just grouping in data.table and dispatch to `na.locf()` which is not really slow. That is why I wanted to see some benchmarking. Looks like Romain got a 5x which may well be worth it in some cases and may help OP. In general, of course, you are correct that there is not that much of a question here. — Dirk Eddelbuettel, Mar 05 '17 at 14:29
@Romain - Thank you for your effort and for guiding me. To be honest, this was my first `Rcpp` program, and it's been only 4 days since I have started using `data.table`. So, very respectfully, do you think you could help me understand how I can use your function to loop through my dataset? I was able to compile your code and test using a dummy object with an index (Numeric vector) and one character vector. I'd appreciate any help and the learning from this exercise. I searched on SO, but couldn't find how I can use `data.table` to loop through all the columns in my dataset. Thanks in advance. — watchtower, Mar 05 '17 at 22:42

Speedy replacement of NA by strings using Data.Table or Rcpp in R

1 Answers1