0

I am adding a column to my data frame and with that column I am trying to group my data based on two conditions (difference in time and amount of distance). My code is composed of an ifelse statement for 50K observations, and it does just fine looping from row 1 to 7, but it refuses to loop past length 8. It didn't show me an error so I'm wondering, am I missing something in my code? Any help is greatly appreciated.

df
j <- 1
df$GroupID <- NA
df$GroupID[1] <- 1
for (i in 2:length(df)) {
  flashes <- df[which(df$GroupID==j)]
  h <- cbind(flashes$Long,flashes$Lat)
  point <- cbind(df$Long[i],df$Lat[i])
  lastrow <- tail(flashes, n = 1)
  moment <- lastrow$DateTime
  
  ifelse (min(spDistsN1(h,point,longlat = TRUE))<16 & 
          difftime(df$DateTime[i],moment)<minutes(15),
          df$GroupID[i] <-j,df$GroupID[i] <-NA)

  #arrange(df[, "GroupID"])
}

j=j+1

The first rows of my data looks like this:

DateTime Lat Long GroupID
2019-07-01 00:00:04 28.478 81.066 1
2019-07-01 00:00:04 28.479 81.068 1
2019-07-01 00:00:04 28.482 81.066 1
2019-07-01 00:00:04 28.475 81.085 1
2019-07-01 00:00:04 28.484 81.084 1
2019-07-01 00:00:04 28.492 81.080 1
2019-07-01 00:00:04 28.493 81.080 1
2019-07-01 00:00:04 28.493 81.081 1
2019-07-01 00:00:04 28.494 81.078 NA
2019-07-01 00:00:04 28.495 81.075 NA
2019-07-01 00:00:04 28.497 81.075 NA
2019-07-01 00:00:04 28.507 81.074 NA

Kate
  • 35
  • 6
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Nov 24 '20 at 05:41
  • What are you trying to do? Can you show expected output for the data shared? – Ronak Shah Nov 24 '20 at 06:12
  • I appreciate the assist, but someone was able to find the issue. It turns out i used length (counts the columns) instead of nrow (counts the rows). That solved my problem. – Kate Nov 24 '20 at 06:14

1 Answers1

1

length() of a data.frame is the number of columns it has, so simply use nrow() instead

df
j <- 1
df$GroupID <- NA
df$GroupID[1] <- 1
for (i in 1:nrow(df)) {
  flashes <- df[which(df$GroupID==j)]
  h <- cbind(flashes$Long,flashes$Lat)
  point <- cbind(df$Long[i],df$Lat[i])
  lastrow <- tail(flashes, n = 1)
  moment <- lastrow$DateTime
  
  ifelse (min(spDistsN1(h,point,longlat = TRUE))<16 & 
          difftime(df$DateTime[i],moment)<minutes(15),
          df$GroupID[i] <-j,df$GroupID[i] <-NA)

  #arrange(df[, "GroupID"])
}

j=j+1
stevec
  • 41,291
  • 27
  • 223
  • 311
  • Thank you, this is what I needed. It looks like it's going past 8 now. Now, that it is going, it looks like it is taking a minute. Any advice on how to speed up my code? – Kate Nov 24 '20 at 06:05
  • @Kate there are a few tricks, you could try on a small subset of the data (e.g. `df[1:300, ]` for the first 300 rows), and time it. Then try refactoring your code inside the loop to make it more efficient. Currently it has 5 assignments inside the loop, if you can somehow reduce that, it could improve performance. Sometimes it's necessary to let it run overnight (or longer) but I would always estimate how long it should take first, as well as check results on a smaller data frame to make sure the results are what you expected – stevec Nov 24 '20 at 06:33