Count factor levels over time

Question

I have the following data.frame that looks like this:

head(entries,10)

     Provider.Region      year.start    month.start day.start  Provider.Status
23511      North West       0010          05        17 Deregistered (V)
23512      North West       0010          05        17 Deregistered (V)
23709   West Midlands       0010          06        01       Registered
23562          London       0010          06        10       Registered
23563          London       0010          06        10       Registered
23566          London       0010          06        10       Registered
23764   West Midlands       0010          06        10 Deregistered (V)
23508          London       0010          06        11 Deregistered (V)
23555   West Midlands       0010          06        11       Registered
23497      South East       0010          06        14 Deregistered (V)

I want to count the factor level corresponding to Provider.Status on a monthly basis. My desired output should be something like this:

head(entries.1, 3)

time    region        Deregistered (V) Registered 
5-0010  North West        2              0
6-0010  West Midlands     2              1
6-0010  London            1              3

At the moment I have been using dplyr as follows

library(dplyr)
entries %>%
  group_by(Provider.Region, year.start, month.start) %>%
  mutate(counts_status = n())

But still does not yield my expected output as it gives something like:

Source: local data frame [23,775 x 6]
Groups: Provider.Region, year.start, month.start [606]

Provider.Region year.start month.start  Provider.Status counts_status
(fctr)     (fctr)      (fctr)              (fctr)         (int)
1       North West       0010          05 Deregistered (V)      2
2       North West       0010          05 Deregistered (V)      2
3    West Midlands       0010          06 Registered            4
4           London       0010          06 Registered            7
5           London       0010          06 Registered            7
6           London       0010          06 Registered            7
7    West Midlands       0010          06 Deregistered (V)      4
8           London       0010          06 Deregistered (V)      7
9    West Midlands       0010          06 Registered            4
10      South East       0010          06 Deregistered (V)      10
..             ...        ...         ...       ...              ...

Is there any compact way where I can create variables from the counts? Many thanks in advance

may you add a `dput(head(entries,10))`to your question ? (will be easiest to reproduce your dataset) — Tensibai, Feb 23 '16 at 10:50
You might find `table(DF$Provider.Region, DF$Provider.Status)` helpful after aggregating as needed. — alexis_laz, Feb 23 '16 at 11:16
@akrun I think this probably is a dupe, but not of that question. Reshaping usually means rearranging the data, not calculating summary stats (like counts). Anyway, I'll undupe for now. Hopefully, someone takes the effort to find a proper dupe later. — Frank, Feb 23 '16 at 20:48

Jaap · Accepted Answer · 2016-02-23T11:27:16.157

This can be achieved using the dcast function from the reshape2 or data.table packages:

library(reshape2)
dcast(mydf, paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status)

library(data.table)
dcast(setDT(mydf), paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status)

the output of the last one:

   year.start Provider.Region Deregistered(V) Registered
1:    0010-05       NorthWest               2          0
2:    0010-06          London               1          3
3:    0010-06       SouthEast               1          0
4:    0010-06    WestMidlands               1          2

When using the above code, you will get a warning message:

Using 'Provider.Status' as value column. Use 'value.var' to override
Aggregate function missing, defaulting to 'length'

This has no implications, but to prevent that you can specify the value.var and the aggregation function:

dcast(setDT(mydf), 
      paste(year.start,month.start,sep="-") + Provider.Region ~ Provider.Status,
      value.var = "Provider.Status", fun.aggregate = length)

Thanks @Jaap. That´s exactly what I needed. For me `data.table` works better than `reshape` since it keeps the names of the variables. Cheers — Edu, Feb 23 '16 at 13:10

score 1 · Answer 2 · answered Feb 23 '16 at 11:04

1

You can use the reshape2 package to produce such a table:

library(reshape2)
d <- data.frame(region=rep(c("A", "B", "C"), each=2), timepoint = c(1, 1, 1, 1, 2, 2), provider=rep(c("D", "R"), 3), count_status = 1:6)
dcast(d, region + timepoint ~ provider, value.var = "count_status")

to get this output:

  region timepoint D R
1      A         1 1 2
2      B         1 3 4
3      C         2 5 6

answered Feb 23 '16 at 11:04

Michael Kuhn

8,302
6
26
27

Good use of reshape. Plus one. – akrun Feb 23 '16 at 16:34

Count factor levels over time

2 Answers2