2

I have read that appending to vectors in R is bad practice. In that case, what should I do when I want to create a vector but I don't know its length ahead of time?

I am looking at a data frame that contains entries about when people are near a specific location. Each entry contains information about the person and the time there were close by, but there can be many entries for a single person.

#    loc  id        time
# 1:   z   A       00:00
# 2:   z   A       00:01
# 3:   z   B       00:02
# 4:   z   A       00:02
# 5:   z   C       00:05
# 6:   z   C       00:07
# 7:   z   A       00:08
# 8:   z   A       00:09
# 9:   z   C       00:09
#10:   z   C       00:10

I want to create a new data frame in which each entry is a "visit" by a person, collating any entries from a single person that are close in time.

#    loc  id   starttime  endtime
# 1:   z   A       00:00   00:02
# 2:   z   C       00:05   00:07
# 3:   z   A       00:08   00:09
# 4:   z   C       00:09   00:10

They may be 50 entries for a single person in the first data frame which may be collated into 3 "visits" in the new data frame. I don't know ahead of time how many "visits" there are. So how should I go about creating this data frame?

I know of rbind, but in this case I would be binding each row one by one. Is that a good idea?

The other option is to go through the first data frame twice, once to figure out how big to make the second data frame and again to fill it, but that seems even more inefficient.

Community
  • 1
  • 1
oregano
  • 816
  • 9
  • 25
  • You just do, and then append to it. That's how dynamic languages do it. This matters more in the large with big lists being expanded which requires copies. More small vectors and value R already overallocates behind your back. – Dirk Eddelbuettel Jul 26 '16 at 15:09
  • 1
    It sounds a little bit like you just want to `melt` and `filter`... without a specific data example it's hard to know though. – Akhil Nair Jul 26 '16 at 15:10
  • 1
    I suggest you chapter 2, "Growing Objects", it's exactly on what you ask :) http://www.burns-stat.com/pages/Tutor/R_inferno.pdf The solution he suggest is the one proposed by @Roland, but he analyzes also other methods with system time used to do certain tasks (grow in chunk vs. rbind vs. subscript). You cn – Eugen Jul 26 '16 at 15:57

2 Answers2

5

I'm not convinced you need this (there is probably a better solution to your poorly described actual problem), but I'll answer the question in the first paragraph. If you don't know how big the results vector needs to be, you initialize it to a reasonable size and grow it in chunks as needed. This limits the times a vector needs to be grown.

set.seed(42)
vec <- numeric(100) #initialize a chunk
i <- 0

repeat {
  test <- rnorm(1)
  if (test > 3) break
  i <- i + 1
  #grow in chunks:
  if (length(vec) < i) vec <- c(vec, numeric(100)) 
  vec[i] <- test
}

#shorten to final length
vec <- vec[seq_len(i)]

You actually do something like that in real live. If you buy a new shelf, you buy it big enough that you have room to spare for future book purchases. When it is full, you buy the next one (or a bigger one).

Roland
  • 127,288
  • 10
  • 191
  • 288
1

This doesn't explicitly answer your question, but demonstrates how you might just create the data you want using cut to create the "visits" then counting the unique number of visits.

library(data.table)
set.seed(1234)
dat <- data.table(visit_time = sample(20, 100, replace = TRUE), 
                  id = sample(LETTERS[1:5], 100, replace = TRUE))
dat[ , visit := cut(visit_time, breaks = seq(0, 20, 5))]
dat[ , list(nvisits = length(unique(visit))), by = id]
#    id nvisits
# 1:  A       4
# 2:  C       4
# 3:  B       4
# 4:  D       4
# 5:  E       4

Running the following shows how many times they were at the location within the same timespan/visit:

dat[ , .N, by = list(id, visit)]
#     id   visit N
# 1:   A   (0,5] 6
# 2:   C (10,15] 5
# 3:   B (10,15] 6
# 4:   A (15,20] 3
# 5:   A (10,15] 5
# 6:   D (10,15] 6
# 7:   E  (5,10] 7
# 8:   B  (5,10] 6
# 9:   E (15,20] 4
# 10:  D   (0,5] 6
# 11:  D  (5,10] 4
# 12:  E   (0,5] 9
# 13:  C   (0,5] 4
# 14:  B (15,20] 1
# 15:  C (15,20] 9
# 16:  B   (0,5] 6
# 17:  A  (5,10] 2
# 18:  C  (5,10] 5
# 19:  D (15,20] 2
# 20:  E (10,15] 4

Edit to show how the cut function will work with time:

I took the randTime function from this excellent answer.

randTime <- function(N, st, et) {
  st <- as.POSIXct(st)
  et <- as.POSIXct(et)
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- sort(runif(N, 0, dt))
  rt <- st + ev
  rt
}

set.seed(1234)
st <- as.POSIXct("2012/01/01 12:00")
et <- as.POSIXct("2012/01/01 18:00")
dat2 <- data.table(visit_time = randTime(100, st, et), 
                  id = sample(LETTERS[1:5], 100, replace = TRUE))
dat2[ , visit := as.character(cut(visit_time, breaks = seq(st, et, "15 min")))]
dat2[ , length(unique(visit)), by = id]
#    id V1
# 1:  A 11
# 2:  C 13
# 3:  B 14
# 4:  D 14
# 5:  E 14
Community
  • 1
  • 1
dayne
  • 7,504
  • 6
  • 38
  • 56
  • Yes. I can edit the answer to show you, but @Roland really answered the question you asked. – dayne Jul 26 '16 at 15:46
  • i.e. you should really accept his answer whether you use the solution I provided or not. – dayne Jul 26 '16 at 15:52