1

I am using the two R packages 'tidyverse' and 'Rcpp' to execute an C++-function within 'mutate' used on a tibble object.

I get the following error:

Error in mutate_impl(.data, dots) : 
  Evaluation error: GC encountered a node (0x112b8d800) with an unknown SEXP type: FREESXP at memory.c:1013. 

I tried to use valgrind on it, but valgrind gives me an error without even executing and I somehow can't get this fixed on my computer. So I would like to ask, if other people get the same error and might have a solution to it.

Here is an example code to be executed:

# load necessary packages
library( tidyverse )
library( Rcpp )

# define C++ function inline
cppFunction( '
             IntegerVector lee_ready_vector( NumericVector & price, NumericVector &bidprice, 
                                NumericVector &askprice ) {
               const int nrows = price.length();
               IntegerVector indicator( nrows );
               if ( nrows < 3 ) {
                 return indicator;
               }
               if ( nrows != bidprice.length() || nrows != askprice.length() ) {
                 throw std::invalid_argument( "Arguments differ in lengths" );
             }

             NumericVector midprice = ( askprice + bidprice ) / 2.0;

             try {
               for( int i = 2; i <= nrows; ++i ) {
                 if ( price[i] == askprice[i] ) {
                   indicator[i] = 1;
                 } else if ( price[i] == bidprice[i] ) {
                   indicator[i] = -1;
                 } else {
                   if ( price[i] > midprice[i] ) {
                     indicator[i] = 1;
                   } else if ( price[i] < midprice[i] ) {
                     indicator[i] = -1;
                   } else { 
                   /* price == midpice */
                       if ( price[i] > price[i-1] ) {
                         indicator[i] = 1;
                       } else if ( price[i] < price[i-1] ) {
                         indicator[i] = -1;
                       } else {
                         if ( price[i] > price[i-2] ) {
                           indicator[i] = 1;
                       } else {
                           indicator[i] = -1;
                       }
                     }
                   }
                 }
               }
             } catch ( std::exception &ex ) {
               forward_exception_to_r( ex );
             } catch (...) {
               ::Rf_error( "c++ exception (unknown reason)" );
             }
             return indicator;
             }')
# define function for random dates inline
latemail <- function( N, st="2012/01/01", et="2012/03/31" ) {
  st <- as.POSIXct( as.Date( st ) )
  et <- as.POSIXct( as.Date( et ) )
  dt <- as.numeric( difftime( et,st,unit="sec" ) )
  ev <- sort(runif( N, 0, dt ) )
  rt <- st + ev
  sort( as.Date( rt ) )
}

# set random seed 
set.seed( 12345 )

# start test loop
# try 100 times to crash the session
# repeat this whole loop several times, if necessary
for ( i in 1:100 ) {
  # 500,000 observation altogether
  N <- 500000
  dates <- latemail( N )
  mid <- sample(seq(from=8.7, to=9.1, by = 0.01), N, TRUE)
  # bid and ask series lay around mid series
  bid <- mid - .1
  ask <- mid + .1
  # p is either equal to bid or ask or lays in the middle
  p <- rep( 0, N )
  for(i in 1:2000) {
    p[i] <- sample( c(mid[i], bid[i], ask[i]), 1 )
  }
  # create the dataset
  df <- tibble( dates, p, bid, ask )

  # execute the C++ function on grouped data
  df %>% group_by( dates ) %>% 
    mutate( ind = lee_ready_vector( p, bid, ask ) ) %>% 
    ungroup()
}  

Is anybody able to reproduce the error. Anyone able to give a solution?

Simon Z.
  • 848
  • 1
  • 6
  • 12

2 Answers2

1

There is a lot going on in your code, and the example is not reproducible which is always a drag. But let's start somewhere:

  1. Your loop index in C++ is for( int i = 2; i <= nrows; ++i ) which is very likely wrong. Indices in C and C++ run from 0 to n-1, so you probably want for( int i = 1; i < nrows; ++i ) which allows to lag once.

  2. Your use of inline and cppFunction is outdated. Use Rcpp Attributes instead. Read a recent intro such as the intro vignette from our recent TAS paper. That also frees you from doing the try/catch at the end.

  3. Your time conversion is too complicated. Just use anytime::anytime() on the input to get POSIXct.

  4. Your lack of indentation does not help. I would write the core part in a proper editor for C++ and maybe include the R snippet after /*** R or have a separate R file.

  5. Lee and Ready is nice but not all that predictive.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • Thanks for coming back to this issue that quick, Dirk. I am working now step by step through the list. First, that the example is not reproducable lets me wonder, if something might not work with my system. Because I can always bring my session to crash by it. I am checking now the rest of the points. – Simon Z. May 27 '18 at 17:58
  • Please see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Dirk Eddelbuettel May 27 '18 at 18:07
  • I am looking for how to simplify my example and I can tell that I am not able to reproduce the error without using the C++-function in dplyr::mutate together with a grouping by a Date or POSIXct variable as shown above in the example. – Simon Z. May 27 '18 at 20:31
  • There is no part "minimally reproducible" that prohibits, say, three groups with four observations each. You need to understand that _we cannot help you if we cannot trigger the error_. No more, no less. – Dirk Eddelbuettel May 27 '18 at 20:40
0

Since my last post here, I tried out the tips, Dirk has given above. Isolating the error to some specific rows of data, turned out to be quite difficult: due to the double grouping of this large dataset and the dependence of rows in the algorithm, I spent a lot of time testing without any success and had still a lot of work to do. At some point I turned to Dirk’s first tip, namely

Your loop index in C++ is for( int i = 2; i <= nrows; ++i ) which is very likely wrong. Indices in C and C++ run from 0 to n-1, so you probably want for( int i = 1; i < nrows; ++i ) which allows to lag once.

So, I recoded my loop so that it is for( int i = 0; i < nrows - 2; ++i ) and adjusted the indices of the variables inside the loop accordingly and the error is gone. So it seemed that for some rows - when the last cases in the loop were reached - an indexing error occurred. From now on I will always start my loop at 0. Even though concrete solutions could not be given, this tip has helped me a lot. Thanks again.

To point 2: In my package I actually use Attributes, I wanted to give users here the possibility to just run the script in the console. For the future: What to do with cpp-files here? Just posting the code and the file names?

Point 3: This is an interesting package. I used it here and there while searching for the error with sample data, but I haven’t heard of it before. Thanks for mentioning this.

Point 4: I edited this above. My apology.

Regarding 5. Lee & Ready: in science this is still the most accepted algorithm for identification of trade direction and as older paper used this algorithm, comparisons with older literature use then the same algorithm. As I know, you are working in the quantitative finance field for a very long time now, what alternative would you suggest, Dirk?

Simon Z.
  • 848
  • 1
  • 6
  • 12