0

Suppose we have the following data:

X Y
  6
  1
  2
  2
1 1
  8
  3
  4
  1
1 2

I want to convert it to:

X Y Y-1 Y-2 Y-3
  6
  1
  2
  2
1 1  2   2   1
  8
  3
  4
  1
1 2  1   4   3

That is: for rows with X=1 - take 3 previous Y values and append them to this row.

I "brute-forced" it with a loop:

namevector <- c("Y-1", "Y-2", "Y-3")
mydata[ , namevector] <- ""

for(i in 1:nrow(mydata)){
  if(mydata$X[i] != ""){mydata[i,3:5] <- mydata$Y[(i-1):(i-3)]}
}

But it was too slow for my dataset of ~300k points - about 10 minutes.

Then I found a post with a similar question, and they proposed which function, which reduced the time to tolerable 1-2 minutes:

namevector <- c("Y-1", "Y-2", "Y-3")
mydata[ , namevector] <- ""
trials_rows <- which(mydata$X != "")

for (i in trials_rows) {mydata[i,3:5] <- mydata$Y[(i-1):(i-3)]}

But considering that which takes less than a second - I believe I can somehow combine which with some kind of transpose function, but I can't get my mind around it.

I have a big data frame (~300k rows), and ~6k rows have this "X" value.

Is there a fast and simple way to do it fast, instead of iterating through the results of which function?

10 Rep
  • 2,217
  • 7
  • 19
  • 33
Rakudajin
  • 13
  • 2
  • 1
    [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. That includes a sample of data, all necessary code, and a clear explanation of what you're trying to do and what hasn't worked. – camille Nov 25 '19 at 03:28

1 Answers1

0

You can do this with a single assignment using some vectorised trickery:

mydata[trials_rows, namevector] <- mydata$Y[trials_rows - rep(1:3,each=length(trials_rows))]

mydata
#    X Y Y-1 Y-2 Y-3
#1  NA 6            
#2  NA 1            
#3  NA 2            
#4  NA 2            
#5   1 1   2   2   1
#6  NA 8            
#7  NA 3            
#8  NA 4            
#9  NA 1            
#10  1 2   1   4   3

Basically, take each row in trials_rows, look backwards three rows using a vectorised subtraction, and then overwrite the combination of trials_rows in rows and namevector in columns.

Reproducible example used here:

mydata <- structure(list(X = c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L), 
    Y = c(6L, 1L, 2L, 2L, 1L, 8L, 3L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, 
-10L))
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thank you, that's exactly what I was looking for. Although I don't quite understand how "rep" works way faster than looping through which, it reduced the time from ~70 to ~2 seconds. Is it basically a vectorized version of what I was manually doing? – Rakudajin Nov 25 '19 at 04:38
  • @Rakudajin - `rep` is written in compiled (C?) code and is basically just generating one big sequence of numbers for each of the rows to be replaced. That kind of operation is usually fractions of a second at worst. I imagine the big improvement here is that there is only one `<-` assignment to your `mydata` rather than 6000 separate assignments. – thelatemail Nov 25 '19 at 04:42