Number rows per group with RevoScaleR

Question

I'm converting a local R script to make use of the RevoScaleR functions in the Revolution-R (aka Microsoft R Client/Server) package. This to be able to scale better with large amounts of data.

The goal is to create a new column that numbers the rows per group. Using data.table this would be achieved using the following code:

library(data.table)
eventlog[,ActivityNumber := seq(from=1, to=.N, by=1), by=Case.ID]

For illustration purposes, the output is something like this:

    Case.ID    ActivityNumber
1       A              1
2       A              2
3       B              1
4       C              1
5       C              2
6       C              3

After some research to do this using the rx-functions I found the package dplyrXdf, which is basically a wrapper to use dplyrfunctions on Xdfstored data, while still benefitting from the optimized functions of RevoScaleR (see http://blog.revolutionanalytics.com/2015/10/using-the-dplyrxdf-package.html)

In my case, this would lead to the following:

result <- eventlog %>%
  group_by(Case.ID) %>%
  mutate(ActivityNumber = seq_len(n()))

However, this leads to the following error:

ERROR: Attempting to add a variable without a name to an analysis.
Caught exception in file: CxAnalysis.cpp, line: 3756. ThreadID: 1248 Rethrowing.
Caught exception in file: CxAnalysis.cpp, line: 5249. ThreadID: 1248 Rethrowing.
Error in doTryCatch(return(expr), name, parentenv, handler) : 
  Error in executing R code: ERROR: Attempting to add a variable without a name to an analysis.

Any ideas how to solve this error? Or other (better?) approaches to get the requested result?

Hong Ooi · Accepted Answer · 2016-08-25T09:15:04.703

Thanks to @Matt-parker for pointing me to this question.

Note that n() is not a regular R function, although it looks like one. It needs to be implemented specially for each data source, and maybe also separately for each of mutate, summarise and filter.

Right now, the only usage of n that is supported for xdf files is within summarise, to count the number of rows. Implementing it for the other verbs is actually nontrivial.

In particular, there is a problem with Matt's use of seq_along to implement n's functionality. Remember that xdf files are block-structured: each chunk of rows is read in and processed independently of other chunks. This means that the sequence generated is for that chunk of rows only, and not for all the rows in a group. If a group spans more than one chunk, the sequence numbers will restart in the middle.

The way to get correct sequence numbers is to keep a running count of how many rows you've read in for that group, and update it each time a chunk is processed. You can do this with a transformFunc, which you pass to transmute via the .rxArgs argument:

ev <- eventlog %>% group_by(Case.ID) %>% transmute(.rxArgs = list(
    transformFunc = function(varList) {
        n <- .n + seq_along(varList[[1]])
        if(!.rxIsTestChunk)  # need this b/c rxDataStep does a test run on the 1st 10 rows
            .n <<- n[length(n)]
        list(n=n)
    },
    transformObjects = list(.n = 0))

This should work with the local, localpar and foreach compute contexts. It may not work (or at least won't give a reproducible result) with any context where you can't guarantee that rxDataStep will process the rows in a deterministic order -- so Mapreduce, Spark, Teradata or similar.

score 1 · Answer 2 · answered Aug 22 '16 at 17:57

I'm not sure why this works, but try using seq_along(Case.ID) instead of seq_len(n()):

result <- eventlog %>%
  group_by(Case.ID) %>%
  mutate(ActivityNumber = seq_along(Case.ID))

It seems to be some problem with n(). Here's my exploratory code, in case anyone else wants to experiment:

options(stringsAsFactors = FALSE)

library(dplyrXdf)

# Set up some test data
eventlog_df <- data.frame(Case.ID = c("A", "A", "A", "A", "A", "B", "C", "C", "C"))

# Add a variable for artificially splitting the XDF into small chunks
eventlog_df$Chunk.ID <- factor((seq_len(nrow(eventlog_df)) + 2) %/% 3)

# Check the results
eventlog_df


# Now read it into an XDF file. I'm going to read just three rows in at a time
# so that the XDF file has several chunks, so we can be confident this works
# across chunks

eventlog <- tempfile(fileext = ".xdf")

for(i in 1:3) {
    rxImport(inData = eventlog_df[eventlog_df$Chunk.ID %in% i, ],
             outFile = eventlog,
             colInfo = list(Case.ID = list(type = "factor", 
                                           levels = c("A", "B", "C"))),
             append = file.exists(eventlog))
}

# Convert to a proper data source
eventlog <- RxXdfData(eventlog)

rxGetInfo(eventlog, getVarInfo = TRUE, numRows = 10)


# Now to dplyr. First, let's make sure it can count up the records
# in each group without any trouble.
result <- eventlog %>%
  group_by(Case.ID) %>%
  summarise(ActivityNumber = n())

# It can:
rxDataStep(result)


# Now if we switch to mutate, does n() still work?
result <- eventlog %>%
  group_by(Case.ID) %>%
  mutate(ActivityNumber = n())

# No - and it seems to be complaining about missing variables. So what if
# we try to refer to a variable we *know* exists?
result <- eventlog %>%
  group_by(Case.ID) %>%
  mutate(ActivityNumber = seq_along(Case.ID))

# It works
rxDataStep(result)

It works indeed. Strange thing is that it sometimes failes with larger datasets (giving the exact same error as n()). I do not know if dplyrXdf is to blame, or something else... — Tim C., Aug 24 '16 at 09:39

score 0 · Answer 3 · answered Aug 18 '16 at 23:44

dplyr and dplyrXdf have a tally method that counts items per group:

result <- eventlog %>%
  group_by(Case.ID) %>%
  tally()

If you want to do more than just tabulate the records per group, you can use summarize (since you didn't show your data, I'm using a hypothetical column called delay, which I'm assuming is numeric for illustrative purposes):

result <- eventlog %>%
  group_by(Case.ID) %>%
  summarize(counts = n(),
            ave_delay = mean(delay))

You could do the above with regular RevoScaleR functions,

rxCrossTabs(~ Case.ID, data = eventlog)

and for the second example:

rxCube(delay ~ Case.ID, data = eventlog)

Thanks for looking into this, but this does not answer my question. I do not want to get the total number of records per group, I want to number the records for each group. See the desired output from the data.table. The group_by is only used to assess where the numbering should start again from 0. — Tim C., Aug 19 '16 at 07:11

Number rows per group with RevoScaleR

3 Answers3