0

I would like to convert the following data.frame into a matrix where the the number of each bike station id that occurs per hour is counted.


> dim(test)
[1] 80623     5

head(test, n = 10)
   bikeid end.station.id start.station.id diff.time hour
1   16052            244              322      6544   14
2   16052            284              432      3406   21
3   16052            461              519     33416    3
4   16052            228              519     26876   13
5   16052             72              435       388   17
6   16052            319              127     27702   11
7   16052            282             2002     33882    4
8   16052            524             2021      2525   10
9   16052            387              351      2397   12
10  16052            388              526     32507   13


The output should look like this.

> sample2
   start.station.id  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1                72 44  1 42 22  9 33 39 47 12 30 39 52 43 45 40 62  9 35 24 43 65 59 58 34
2                79 21 11  2 42  5 18 57 64 32 47 61 43 65 38 46 61 48 29 58 22 35  4 50 31
3                82 19 44  7 52 14 19  3 30 25 60 33 60 48 54 25 24 42 62 13 51 23 43 54  7
4                83 45 60 64  5  0  3 54 16 48 67 49 20 59 21 24 38 42 62 38 24  1 35 16  4
5               116 27 62 64 44 55 65 23 13 36  0 62 54 61  6 16  7 58 41 29  1 34 58 35 67
6               119 45 30 41 26  7 39 16 55 28 53 42  9  5 31 18 16 14 37 17 14 16 17 23 50
7               120  3  2  7 53 21 33 31 48 19 50 35 47  8 17 30  9 49  4 48 28 52  9 57 55
8               127 33 44 47 42  6 46 39 30 39 28 19 57 53 41 45 55  9 27 42 19 43 24 37 55
9               137 53 11 60  1 66 37 16  5  2 58  0 46 33  0 60 54 25 66 65 40 36 47 58 40
10              143 61  1 50 62 57 33 12 15 27 19 65 48 12 55 64 14 22 13 12 57 45 13 66 56 66 56

I was advised to use a formula similar to :

matrix <- test %>% 
  group_by(start.station.id, hour)%>%
  summarise(sum = nrow) %>%
  spread(hour, nrow) 

but do not know how to code it properly

iskandarblue
  • 7,208
  • 15
  • 60
  • 130
  • Perhaps you can try the `reshape2` package and the `dcast` function like this `dcast(df,start.station.id~hour,length)` – Roman Feb 02 '16 at 14:16

1 Answers1

1

Using data.table:

library(data.table) #1.9.6+
setDT(test)
dcast(test[ , .N, by = .(start.station.id, hour)],
      start.station.id ~ hour, value.var = "N")

Alternatively (slower, though cleaner):

dcast(test, start.station.id ~ hour, fun.aggregate = length, value.var = "hour")

Testing on some fake data:

set.seed(10932)
NN <- 1e6
test <- data.table(start.station.id = sample(1000, NN, T),
                   hour = sample(24, NN, T))

library(microbenchmark)

microbenchmark(times = 100L,
               preagg = dcast(test[ , .N, by = .(start.station.id, hour)],
                              start.station.id ~ hour, value.var = "N"),
               postagg = dcast(test, start.station.id ~ hour, 
                               fun.aggregate = length, value.var = "hour"))

Unit: milliseconds
    expr      min       lq      mean   median        uq      max neval
  preagg 55.83240 59.88939  66.56289 61.37408  64.37049 166.8902   100
 postagg 91.16012 93.68588 101.17297 96.04823 101.20717 203.4270   100

The reason the first is faster is that the operation test[ , .N, by = vars] has been optimized in data.table.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198