How to transform a dataset in R?

Question

For my job I am trying to write some code to calculate the required number of parking spaces. I have data about the number of cars arriving each hour and about the parking duration (generated via rnorm) of each car on the parking. Now i would like to calculate per minute how many parking spaces are required.

dataset

(Hour & Attraction_intensity variables)

timeonparking <- round(rnorm(14, mean = 35, sd = 10))

First I would like to generate X numbers (uniform distribution; representing minute of arrival within the given hour) for each row/hour between 0-59 where X is equal to the attraction_intensity that hour.

The new dataframe should look like this:

new dataframe

Could someone help me please? My first idea was to use a for loop. But this would not result in the table shown above, and the code contains errors which i cannot find (i am a beginner to R). I don't know how to transform the dataset.

First attempt:

for (i in nrow(df) {
    df1 <- paste(df$ï..hour[i], list(runif(df$attraction_vehicles[i], min = 0, max = 59)))
}

Welcome to Stack Overflow! Could you make your problem reproducible by sharing a sample of your data so others can help (please do not use `str()`, `head()` or screenshot)? You can use the [`reprex`](https://reprex.tidyverse.org/articles/articles/magic-reprex.html) and [`datapasta`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) packages to assist you with that. See also [Help me Help you](https://speakerdeck.com/jennybc/reprex-help-me-help-you?slide=5) & [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269) — Tung, Jan 12 '20 at 13:19
Three things: (1) the use of `list` in the `paste` is unnecessary, any reason you have that? (2) it seems odd that your code includes a unicode column name though the image of the data does not. (3) When including sample data with some form of randomness, it is really helpful to start from a known point, please use `set.seed` so that your intermediate steps can be verified. — r2evans, Jan 12 '20 at 13:27
@Tung That's the first time I've heard of the `datapasta` package -- looks awesome! OP: You might be interested in the following related problem: https://stackoverflow.com/questions/19518728/replicate-each-row-of-data-frame-and-specify-the-number-of-replications-for-each/19519828#19519828 — duckmayr, Jan 12 '20 at 13:36

r2evans · Accepted Answer · 2020-01-12T20:48:02.880

There are several ways to approach this, but let's start from a known point:

dat <- data.frame(
  hour = c("5:00:00", "6:00:00", "7:00:00"),
  attraction = c(1, 3, 6)
)
dat$hour <- as.POSIXct(dat$hour, format = "%H:%M:%S")
dat
#                  hour attraction
# 1 2020-01-12 05:00:00          1
# 2 2020-01-12 06:00:00          3
# 3 2020-01-12 07:00:00          6

Since you're looking to do time-based calcs, I set hour as a POSIXt type. (If you have a "date" component in your data as well, you'll want to include that in the conversion, but if this is always in the same day, then it does not appear to really matter.)

From here, we can introduce random minutes for each arrival:

set.seed(42)
dat2 <- do.call(
  "rbind.data.frame",
  Map(function(hr, n) data.frame(hour = hr, min = round(runif(n, min = 0, max = 59))),
      dat$hour, dat$attraction)
)
dat2
#                   hour min
# 1  2020-01-12 05:00:00  54
# 2  2020-01-12 06:00:00  55
# 3  2020-01-12 06:00:00  17
# 4  2020-01-12 06:00:00  49
# 5  2020-01-12 07:00:00  38
# 6  2020-01-12 07:00:00  31
# 7  2020-01-12 07:00:00  43
# 8  2020-01-12 07:00:00   8
# 9  2020-01-12 07:00:00  39
# 10 2020-01-12 07:00:00  42

I don't know if you need the minute separately or as a real time, so perhaps

dat2$arrival_time <- dat2$hour + (60 * dat2$min)
dat2
#                   hour min        arrival_time
# 1  2020-01-12 05:00:00  54 2020-01-12 05:54:00
# 2  2020-01-12 06:00:00  55 2020-01-12 06:55:00
# 3  2020-01-12 06:00:00  17 2020-01-12 06:17:00
# 4  2020-01-12 06:00:00  49 2020-01-12 06:49:00
# 5  2020-01-12 07:00:00  38 2020-01-12 07:38:00
# 6  2020-01-12 07:00:00  31 2020-01-12 07:31:00
# 7  2020-01-12 07:00:00  43 2020-01-12 07:43:00
# 8  2020-01-12 07:00:00   8 2020-01-12 07:08:00
# 9  2020-01-12 07:00:00  39 2020-01-12 07:39:00
# 10 2020-01-12 07:00:00  42 2020-01-12 07:42:00

I should note that your use of rnorm "can" result in negative minutes, since it is asymptotically infinite; using sd=10 reduces the likelihood, certainly, but if you need the random arrival time to "always" be within the specified hour, then either your use of runif is better or you might consider a truncated-normal distribution such as provided by the truncnorm package.

Note: I use Map, which is a multi-parameter version of lapply. There are often advantages (sometimes in performance, sometimes readability) to using functions from R's apply family, and while the performance benefits have mostly been mitigated (historically for was often slower than sapply), some still find *apply better. In the case of Map, I've written a few answers explaining (by "unrolling" it) how it works: https://stackoverflow.com/a/57367292 and https://stackoverflow.com/a/54485425.

To get occupancy-rates (how many cars in a given period), I suggest you use cut to bin the arrival times. We can create bin boundaries with something like:

myseq <- round(range(dat2$arrival_time) + c(-1800,1800), "hour")
myseq
# [1] "2020-01-12 05:00:00 PST" "2020-01-12 08:00:00 PST"

myseq <- seq.POSIXt(myseq[1], myseq[2], by = "min")
length(myseq)
# [1] 181

myseq <- myseq[seq_along(myseq) %% 10 == 1]
myseq
#  [1] "2020-01-12 05:00:00 PST" "2020-01-12 05:10:00 PST" "2020-01-12 05:20:00 PST"
#  [4] "2020-01-12 05:30:00 PST" "2020-01-12 05:40:00 PST" "2020-01-12 05:50:00 PST"
#  [7] "2020-01-12 06:00:00 PST" "2020-01-12 06:10:00 PST" "2020-01-12 06:20:00 PST"
# [10] "2020-01-12 06:30:00 PST" "2020-01-12 06:40:00 PST" "2020-01-12 06:50:00 PST"
# [13] "2020-01-12 07:00:00 PST" "2020-01-12 07:10:00 PST" "2020-01-12 07:20:00 PST"
# [16] "2020-01-12 07:30:00 PST" "2020-01-12 07:40:00 PST" "2020-01-12 07:50:00 PST"
# [19] "2020-01-12 08:00:00 PST"

The first command finds the range of times and rounds it out to the next hour. (The use of +c(-1800,1800) ensures that the round will give us a floor and ceiling, respectively. This might find corner cases that are imperfect, but it should work most of the time.) The second command creates a per-minute sequence, 181 long here (three hours). The third command cuts this to just one every 10 minutes.

You should be able to easily adjust these three commands to your needs.

From here, you can use

cut(dat2$arrival_time, myseq)
#  [1] 2020-01-12 05:50:00 2020-01-12 06:50:00 2020-01-12 06:10:00 2020-01-12 06:40:00
#  [5] 2020-01-12 07:30:00 2020-01-12 07:30:00 2020-01-12 07:40:00 2020-01-12 07:00:00
#  [9] 2020-01-12 07:30:00 2020-01-12 07:40:00
# 18 Levels: 2020-01-12 05:00:00 2020-01-12 05:10:00 2020-01-12 05:20:00 ... 2020-01-12 07:50:00

which gives you which 10-minute bin each arrival belongs to. A quick summary can be done with

table(cut(dat2$arrival_time, myseq))
# 2020-01-12 05:00:00 2020-01-12 05:10:00 2020-01-12 05:20:00 2020-01-12 05:30:00 
#                   0                   0                   0                   0 
# 2020-01-12 05:40:00 2020-01-12 05:50:00 2020-01-12 06:00:00 2020-01-12 06:10:00 
#                   0                   1                   0                   1 
# 2020-01-12 06:20:00 2020-01-12 06:30:00 2020-01-12 06:40:00 2020-01-12 06:50:00 
#                   0                   0                   1                   1 
# 2020-01-12 07:00:00 2020-01-12 07:10:00 2020-01-12 07:20:00 2020-01-12 07:30:00 
#                   1                   0                   0                   3 
# 2020-01-12 07:40:00 2020-01-12 07:50:00 
#                   2                   0

Thank you very much for your help. This code works very well to define the boundaries of my intervals. Can i ask you a follow-up question? I would like to calculate the number of required parking spaces per minute (1 - 1440 minutes). x <- 1:1440 How can i count the number of intervals in which X[i] lies in R? — Philippe Goethals, Jan 12 '20 at 18:11
See my edit. If it gets more complex, you might need a new question. — r2evans, Jan 12 '20 at 20:48
Sorry for the late answer. The code works. Thank you for your help. — Philippe Goethals, Jan 28 '20 at 19:22

How to transform a dataset in R?

1 Answers1