Count rows in data table with certain values by group

Question

I have a a data table that somewhat looks like this:

Property    Type
1           apartment
1           office
2           office
2           office
3           apartment
3           apartment
3           office

I now want to count offices and apartments by property:

Property    Type       number_of_offices    number_of_apartments
       1    apartment                  1                       1
       1    office                     1                       1
       2    office                     2                       0
       2    office                     2                       0
       3    apartment                  1                       2
       3    apartment                  1                       2
       3    office                     1                       2

I tried

my.DT <- myDT[,.(Type=Type, number_of_offices=nrow(my.DT[my.DT$Type=="office",]), number_of_apartments=nrow(my.DT$Type=="apparment",], by="Property")

However, this only gives me the total counts for the whole data table. Does anyone have a solution?

Hi laser.p , welcome to SO! To make your questions easier to answer you can follow these guidelines on how to make a good reproducible example in R: https://stackoverflow.com/q/49860402/1870254 — jan-glx, Mar 12 '20 at 14:35

score 4 · Accepted Answer · answered Mar 12 '20 at 14:38

You can solve it as follows:

cols <- c("number_of_offices", "number_of_apartments")
df[, (cols) := .(sum(Type == "office"), sum(Type == "apartment")), Property]

# Property      Type number_of_offices number_of_apartments
# 1:        1 apartment                 1                    1
# 2:        1    office                 1                    1
# 3:        2    office                 2                    0
# 4:        2    office                 2                    0
# 5:        3 apartment                 1                    2
# 6:        3 apartment                 1                    2
# 7:        3    office                 1                    2

Arturo Sbr · Answer 2 · 2020-03-12T14:24:45.293

Is there a particular reason why you want to merge the grouped counts with my.df?

You can try this, which will give you the counts grouped by Property and Type. Then merge with the original myDT:

grouped = myDT[, .N, by=c('Property','Type')]
myDT = merge(myDT, grouped[Type == 'apartment', list(Property,N)], by='Property', all.x=TRUE)
myDT = merge(myDT, grouped[Type == 'office', list(Property,N)], by='Property', all.x=TRUE)
setnames(myDT, c('N.x','N.y'), c('Number of appartments','Number of offices'))
myDT[is.na(myDT)] <- 0

> myDT
   Property      Type Number of appartments Number of offices
1:        1 apartment                     1                 1
2:        1    office                     1                 1
3:        2    office                     0                 2
4:        2    office                     0                 2
5:        3 apartment                     2                 1
6:        3 apartment                     2                 1
7:        3    office                     2                 1

Thanks for your answer! This yields the correct numbers, which is already a step forward. The reason for merging it with my.df is that I later want to merge it with a much larger list. That list has one row per property with a lot of columns containing information (e.g. number of offices and appartments, etc.). — laser.p, Mar 12 '20 at 14:22
I edited the answer slightly. I added a left join to get the desired result. — Arturo Sbr, Mar 12 '20 at 14:30

score 0 · Answer 3 · answered Mar 12 '20 at 15:41

Your approach is a good start!

my.DT <- myDT[,.(Type=Type, number_of_offices=nrow(my.DT[my.DT$Type=="office",]), number_of_apartments=nrow(my.DT$Type=="apparment",], by="Property")

The brackets don't match, I assume there was some copy-paste error. These can efficiently avoided by following the guidelines on how to create a great reproducible example & reprex package. The code in your question could then look something like this:

library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.2

rooms_dt <- fread(text =
"Property    Type
1           apartment
1           office
2           office
2           office
3           apartment
3           apartment
3           office
")
rooms_dt[, .(
  Type=Type, 
  number_of_offices=nrow(rooms_dt[rooms_dt$Type=="office",]), 
  number_of_apartments=nrow(rooms_dt[rooms_dt$Type=="apartment",])
), by="Property"]
#>    Property      Type number_of_offices number_of_apartments
#> 1:        1 apartment                 4                    3
#> 2:        1    office                 4                    3
#> 3:        2    office                 4                    3
#> 4:        2    office                 4                    3
#> 5:        3 apartment                 4                    3
#> 6:        3 apartment                 4                    3
#> 7:        3    office                 4                    3

^{Created on 2020-03-12 by the reprex package (v0.3.0)}

As you noticed, the problem with that approach is that the j expression that gets evaluated for every group (= every Property) the numbers of all apartment/office rows are counted not just the one of the current Property. This could be avoided by subsetting not only for the correct type but also the correct property (available in j as the special variable .BY). This already yields the desired result:

rooms_dt[, .(
  Type=Type, 
  number_of_offices=nrow(rooms_dt[rooms_dt$Property == .BY & rooms_dt$Type=="office",]), 
  number_of_apartments=nrow(rooms_dt[rooms_dt$Property == .BY & rooms_dt$Type=="apartment",])
), by="Property"]
#>    Property      Type number_of_offices number_of_apartments
#> 1:        1 apartment                 1                    1
#> 2:        1    office                 1                    1
#> 3:        2    office                 2                    0
#> 4:        2    office                 2                    0
#> 5:        3 apartment                 1                    2
#> 6:        3 apartment                 1                    2
#> 7:        3    office                 1                    2

However, this solution can be simplified quite a bit: First and foremost, the columns of the data.table can be accessed directly by their name within []:

rooms_dt[, .(
  Type=Type,
  number_of_offices=nrow(rooms_dt[Property == .BY & Type=="office",]),
  number_of_apartments=nrow(rooms_dt[Property == .BY & Type=="apartment",])
), by="Property"]

Furthermore, data.table provides the current data.table subsetted [sic] to the current group as .SD, which allows to simplify this further:

rooms_dt[, .(
  Type=Type,
  number_of_offices=nrow(.SD[Type=="office",]),
  number_of_apartments=nrow(.SD[Type=="apartment",])
), by="Property"]

Since not only .SD is subsetted, but all columns are, it is even simpler to just work with them directly:

rooms_dt[, .(
  Type = Type,
  number_of_offices = sum(Type=="office"),
  number_of_apartments = sum(Type=="apartment")
), by="Property"]

One of the strengths of data.table is that in contrast to most R stuff is, that they are mutable. This allows to just add the new columns directly to the old data.table without creating a new one:


rooms_dt[, number_of_offices := sum(Type=="office"), by="Property"]
rooms_dt[, number_of_apartments := sum(Type=="apartment"), by="Property"]

This culd also be written as:

rooms_dt[, c("number_of_offices", "number_of_apartments") := list(sum(Type=="office"), sum(Type=="apartment")), by="Property"]

or as :

rooms_dt[, `:=`(number_of_offices = sum(Type=="office"), number_of_apartments = sum(Type=="apartment")), by="Property"]

This was not as easy as it could be, probably because it is rearly needed, often it is desirable to have one of these instead:

rooms_dt[, count_in_property := .N, by=.(Property, Type)]
rooms_dt
#>    Property      Type number_in_property
#> 1:        1 apartment                  1
#> 2:        1    office                  1
#> 3:        2    office                  2
#> 4:        2    office                  2
#> 5:        3 apartment                  2
#> 6:        3 apartment                  2
#> 7:        3    office                  1

or

dcast(rooms_dt[, .(number_of = .N), by=.(Property, Type)], Property ~ Type, fill = 0)
#> Using 'number_of' as value column. Use 'value.var' to override
#>    Property apartment office
#> 1:        1         1      1
#> 2:        2         0      2
#> 3:        3         2      1

score 0 · Answer 4 · answered Mar 12 '20 at 15:58

This is the data.table way to do this for more than two room types, I think:

dcast(
  rooms_dt[, .(number_of=.N), by=.(Property, Type)], 
  Property ~ Type, 
  fill=0
)[rooms_dt, on=.(Type)]

Full example:

library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.2
rooms_dt <- fread(text =
                    "Property    Type
1           apartment
1           office
2           office
2           office
3           apartment
3           apartment
3           office
")

rooms_summary_dt <- rooms_dt[, .(number_rooms_of_type_at_property = .N), by=.(Property, Type)]
rooms_summary_dt
#>    Property      Type number_rooms_of_type_at_property
#> 1:        1 apartment                                1
#> 2:        1    office                                1
#> 3:        2    office                                2
#> 4:        3 apartment                                2
#> 5:        3    office                                1

properties_dt <- dcast(rooms_summary_dt, Property ~ Type, fill = 0)
#> Using 'number_rooms_of_type_at_property' as value column. Use 'value.var' to override
properties_dt
#>    Property apartment office
#> 1:        1         1      1
#> 2:        2         0      2
#> 3:        3         2      1

properties_dt[rooms_dt, on=.(Property)]
#>    Property apartment office      Type
#> 1:        1         1      1 apartment
#> 2:        1         1      1    office
#> 3:        2         0      2    office
#> 4:        2         0      2    office
#> 5:        3         2      1 apartment
#> 6:        3         2      1 apartment
#> 7:        3         2      1    office

^{Created on 2020-03-12 by the reprex package (v0.3.0)}

Sotos · Answer 5 · 2020-03-12T14:28:12.117

Here is the dplyr/tidyr solution,

library(dplyr)
library(tidyr)

df %>% 
 group_by_all() %>% 
 count() %>% 
 pivot_wider(names_from = Type, values_from = n) %>% 
 left_join(., df, by = 'Property')

which gives,

# A tibble: 7 x 4
# Groups:   Property [3]
  Property apartment office Type     
     <int>     <int>  <int> <fct>    
1        1         1      1 apartment
2        1         1      1 office   
3        2        NA      2 office   
4        2        NA      2 office   
5        3         2      1 apartment
6        3         2      1 apartment
7        3         2      1 office

DATA

dput(df)
structure(list(Property = c(1L, 1L, 2L, 2L, 3L, 3L, 3L), Type = structure(c(1L, 
2L, 2L, 2L, 1L, 1L, 2L), .Label = c("apartment", "office"), class = "factor")), class = "data.frame", row.names = c(NA, 
-7L))

Count rows in data table with certain values by group

5 Answers5