0

I need to sequently analyze a dataset while using subresults of the operations before.

As I am known to R I decided to work with this and one of the solution I tried is using an for loop.

The dataset which I loop through has around 8 million rows with 4 columns.

I use a data.table and the variables are of type character eg. "XXXXXXXXX"

I tried to loop through but it takes approx 0,7 second per cycle from which the "<-" operation takes half a second.

Can anybody recommend a better technique. Potentially rcpp, apply or whatever?

Thx for your support,

Holger

'%!in%' <- function(x,y)!('%in%'(x,y))
library('data.table')    


dt_loop <- data.table(
              paste0("XXXXXXXXXX", 1:80000000),
              paste0("YXXXXXXXXX", 1:80000000),
              paste0("ZXXXXXXXXX", 1:80000000),
              paste0("AXXXXXXXXX", 1:80000000)
      )

    colnames(dt_loop)[colnames(dt_loop)=="V1"] <- "m"
    colnames(dt_loop)[colnames(dt_loop)=="V2"] <- "c"
    colnames(dt_loop)[colnames(dt_loop)=="V3"] <- "ma"
    colnames(dt_loop)[colnames(dt_loop)=="V4"] <- "unused"


    for(i in 1:nrow(dt_loop)){
      m <- dt_loop$m[i]
      c <- dt_loop$m[i]

      if(m %!in% dt_loop$ma[1:i] & c %!in% dt_loop$ma[1:i]){
        dt_loop$ma[i] <- m
      } else { 
        if(m %in% dt_loop$ma[1:i]){
          dt_loop$ma[i] <- m
        } else {
          dt_loop$ma[i] <- c
        }
      } 
    }
Holger
  • 31
  • 2
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. We don't need all 8 million rows, but we do need something to test with and compare results. It would also be a good idea to describe what you are trying to do in words; there may be better strategies. – MrFlick May 17 '19 at 15:32
  • 1
    This code seems to start with a dataset where none of the values across four columns are repeated. Then, for each row, we look to see if `m` hasn't occurred up to that point in `ma` (it hasn't) and if `c` hasn't occurred up to that point in `ma` (it hasn't). Given that, replace `ma` with `m` in that row, and repeat. I expect this will result in `ma` being set to `m` for all rows. What am I missing? – Jon Spring May 18 '19 at 05:02

1 Answers1

0

This is a self-join Cartesian product solution. I modified your code to get somewhat meaningful results. I also think that if you have 8 million rows, you're going to have performance troubles when the nth loop depends on the nth before hand.

Changes in data structure:

  1. Used sample to get some repeats in the data.table
  2. Simplified the column names to data.table function setnames()
  3. Added an ID field
  4. Removed unused column.
'%!in%' <- function(x,y)!('%in%'(x,y))
library('data.table')    

# Generate Data -----------------------------------------------------------

set.seed(1)
n_rows <- 10
dt_loop <- data.table(
  sample(paste0("X", 1:n_rows), n_rows, replace = T),
  sample(paste0("Y", 1:n_rows), n_rows, replace = T),
  sample(paste0("X", 1:n_rows), n_rows, replace = T)
)

setnames(dt_loop, c('m', 'c', 'ma'))
dt_loop[, ID := .I]

I made significant changes to your loop.

  1. Assigned c <- dt_loop$c[i] as I don't know what using m there did.
  2. Removed the first if statement because of the new assignment of c.
# Original loop with Minor Mod --------------------------------------------

for(i in 1:nrow(dt_loop)){
  m <- dt_loop$m[i]
  c <- dt_loop$c[i] #changed to c instead of m

#Removed first ifelse condition
  #as it didn't make sense as originally constructed

  # if(m %!in% dt_loop$ma[1:i] & c %!in% dt_loop$ma[1:i]){
    # dt_loop$ma2[i] <- m
  # } else {
    if(m %in% dt_loop$ma[1:i]){
      dt_loop$ma2[i] <- m
    } else {
      dt_loop$ma2[i] <- c
    }
  # }
}
dt_loop

      m   c  ma ID ma2
 1:  X3  Y3 X10  1  Y3
 2:  X4  Y2  X3  2  Y2
 3:  X6  Y7  X7  3  Y7
 4: X10  Y4  X2  4 X10
 5:  X3  Y8  X3  5  X3
 6:  X9  Y5  X4  6  Y5
 7: X10  Y8  X1  7 X10
 8:  X7 Y10  X4  8  X7
 9:  X7  Y4  X9  9  X7
10:  X1  Y8  X4 10  X1

The self-join seems to be faster than the loop when I up the rows to 10,000, but it still slows down. Of note is that you can see when there is duplication with ma because the cartesian product expands the results so you get N == 2.

I believe that there are ways to get the self-join to work so that you only get the Nth row which should relieve some pressure.

dt_loop[dt_loop
        , on = .(ID <= ID
                 , ma = m)
        , .(.N
            ,i.ma2 #for comparison - remove
            ,ma3 = ifelse(is.na(x.ID), i.c, i.m)
            ,i.ID, i.m, i.c, i.ma
            ,x.ID, x.m, x.c, x.ma 
        )
        , by = .EACHI
        , allow.cartesian = T]

    ID  ma N i.ma2 ma3 i.ID i.m i.c i.ma x.ID  x.m  x.c x.ma
 1:  1  X3 0    Y3  Y3    1  X3  Y3  X10   NA <NA> <NA> <NA>
 2:  2  X4 0    Y2  Y2    2  X4  Y2   X3   NA <NA> <NA> <NA>
 3:  3  X6 0    Y7  Y7    3  X6  Y7   X7   NA <NA> <NA> <NA>
 4:  4 X10 1   X10 X10    4 X10  Y4   X2    1   X3   Y3  X10
 5:  5  X3 2    X3  X3    5  X3  Y8   X3    2   X4   Y2   X3
 6:  5  X3 2    X3  X3    5  X3  Y8   X3    5   X3   Y8   X3
 7:  6  X9 0    Y5  Y5    6  X9  Y5   X4   NA <NA> <NA> <NA>
 8:  7 X10 1   X10 X10    7 X10  Y8   X1    1   X3   Y3  X10
 9:  8  X7 1    X7  X7    8  X7 Y10   X4    3   X6   Y7   X7
10:  9  X7 1    X7  X7    9  X7  Y4   X9    3   X6   Y7   X7
11: 10  X1 1    X1  X1   10  X1  Y8   X4    7  X10   Y8   X1
Cole
  • 11,130
  • 1
  • 9
  • 24