2

I have a task where I need to process a large matrix (millions of rows, hundreds of columns) of character strings. Each row operation is independent. As such, I would like to exploit some parallel computing to increase the speed of the overall project.

If I build myWorker for numeric matrices, as follows, I'm able to compile the code without errors

// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
#include <Rcpp.h>
#include <string.h>

struct myWorker : public RcppParallel::Worker
{
  // input
  const RcppParallel::RMatrix<double> input;
  int version;

  // output
  RcppParallel::RMatrix<double> outmat;

  // initialization
  myWorker(const Rcpp::NumericMatrix input, int version, Rcpp::NumericMatrix outmat) 
    : input(input), version(version), outmat(outmat) {}

  // the operator
  void operator()(std::size_t begin, std::size_t end) {
    // do stuff
  }
};

However, when I set the input matrix and initialization to use Rcpp::CharacterMatrix I get compile errors.

In instantiation of ‘RcppParallel::RMatrix<T>::RMatrix(const Source&) [with
Source = Rcpp::Matrix<16>; T = <typehere>]

R/x86_64-pc-linux-gnu-library/3.3/RcppParallel/include/RcppParallel/RMatrix.h:198:28:
error: cannot convert ‘Rcpp::Matrix<16>::iterator {aka
Rcpp::internal::Proxy_Iterator<Rcpp::internal::string_proxy<16> >}’ to
‘std::basic_string<char>*’ in initialization
         ncol_(source.ncol())

Combinations I've tried with myWorker(const Rcpp::NumericMatrix input

const RcppParallel::RMatrix<std::string> input;
const RcppParallel::RMatrix<char> input;
const RcppParallel::RMatrix<char*> input;
const RcppParallel::RMatrix<char**> input;
const RcppParallel::RMatrix<char32_t> input;

The pointers were a bad idea. The other options lead to a common error noted above.

A very similar question was asked here.

Is there a simple way to wrap a Rcpp::NumericMatrix with RcppParallel::RMatrix for thread-safe work with a character matrix?

EDIT

More details on the task:

The imput matrix consists of ICD-9-CM or ICD-10-CM codes which need to be compared to sets of codes to determine classifications. There are millions of rows, hundreds of columns, and about a dozen classifications.

An small example in pure R would be:

classification_1 <-
  c("99680", "99688", "99689", "V421", "V422", "V426", "V5391", "4697", "5051",
    "5059", "5280", "5282", "4103", "0091", "0092", "0093")
classification_2 <-
  c("14", "15", "16", "17", "18", "19", "20", "23", "V4281", "V4282", "0010", "9925")

icd_codes <- 
  structure(c("5282", "3320", "4100", "0234", "V426", "3895", "3592", 
              "5651", "0397", "V5302", "5675", "0092", "V461", "4697", "5571", 
              "3776", "9964", "9702", "3583", "8607", "99661", "3767", "3129", 
              "3182", "5503", "5285", "4641", "6861", "3351", "2751", "76511", 
              "V446", "34581", "7472", "5190", "9723", "28801", "0010", "8103", 
              "4270", "9962", "4211", "4242", "34511", "3352", "0372", "76492", 
              "5675", "284", "4281", "3314", "0681", "3781", "0152", "3760", 
              "3763", "5597", "4399", "V5351", "8108", "3994", "4581", "V460", 
              "5533", "8137", "99663", "4210", "741", "5722", "8949", "76412", 
              "5569", "5674", "99667", "7707", "3753", "8606", "V553", "5051", 
              "2884", "5059", "7711", "8136", "5673", "7373", "2821", "5993", 
              "3776", "2822", "4274", "3789", "0371", "3591", "76523", "5722", 
              "V56", "V445", "2359", "4243", "99683"), .Dim = c(5L, 20L))

apply(icd_codes, 1,
      function(x) {
        c(class1 = as.integer(any(x %in% classification_1)),
          class2 = as.integer(any(x %in% classification_2)))
      }) 

Each row the icd_codes object could be evaluated in parallel. Since I have a working single-threaded C++ version of the above working, I was hoping to use RcppParallel to improve the overall speed of the work, and critically, do so in a way that is as close to OS independent as possible. The group I'm working with consists of Windows, OSX, and Linux users.

Community
  • 1
  • 1
Peter
  • 7,460
  • 2
  • 47
  • 68

1 Answers1

0

For an extremely fast Rcpp-based matrix algebra solution to the co-morbidity classification problem, see my package icd, particularly the PDF article sent to JSS on the methodology.

It'll never be fast with string processing, as you'll quickly find out when profiling, no matter how much higher-level optimization you do.

Jack Wasey
  • 3,360
  • 24
  • 43