2

I'm trying to group my observations by a set of variables under another set of variables which is, finally, under a last set of variables. Here's what I have for example:

     country      name     ethnicity   party

     Afghanistan  john     Pashtun     X Party
     Afghanistan  oliver   Pashtun     Y Party
     Afghanistan  brad     Tajik       X Party
     Afghanistan  chad     Hazara      X Party
     Bosnia       virgin   Serb        P Party
     Bosnia       mary     Serb        P Party
     Bosnia       jesus    Croat       C Party

What I'm going for should create the set of all existing ethnicities under each party and count how many persons are under each ethnicity in a party, within a country and look something like:

     country      party     ethnicity   count

     Afghanistan  X Party   Pashtun     1
     Afghanistan  X Party   Tajik       1
     Afghanistan  X Party   Hazara      1
     Afghanistan  Y Party   Pashtun     1
     Afghanistan  Y Party   Tajik       0
     Afghanistan  Y Party   Hazara      0
     Bosnia       P Party   Serb        2
     Bosnia       P Party   Croat       0
     Bosnia       C Party   Serb        0
     Bosnia       C Party   Croat       1

So far I've tried the functions group_by and aggregate to no avail.

Emir Dakin
  • 148
  • 5

2 Answers2

1

this is a really simply operation, please read this book https://r4ds.had.co.nz/

library(data.table)
library(tidyverse)

df_example <- fread("country      name     ethnicity   party coolness
Afghanistan  john     Pashtun     X_Party     cool
Afghanistan  oliver   Pashtun     Y_Party     not_cool
Afghanistan  brad     Tajik       X_Party     cool
Afghanistan  chad     Hazara      X_Party     not_cool
Bosnia       virgin   Serb        P_Party     cool
Bosnia       mary     Serb        P_Party     cool
Bosnia       jesus    Croat       C_Party     not_cool" ,

                    header = TRUE)


df_example %>% 
  group_by(country,ethnicity,party) %>% 
  add_tally() %>% 
  select(-name) %>% # Some stuff that you don't want
  distinct()
Bruno
  • 4,109
  • 1
  • 9
  • 27
  • Great. How do I keep other variables that I have not listed in this example? Because the grouping created a subset of the data I had. Do I have to use something by the likes of `.~` perhaps? – Emir Dakin Jan 03 '20 at 14:20
  • 1
    Use add_tally() instead – Bruno Jan 03 '20 at 14:25
  • Thanks a lot. One last thing, this creates duplicates of the same ethnicities under a party because it modifies every person's row individually. For instance, let's say I have 30 observations of Pashtun ethnicities under the X Party, this creates 30 duplicate rows with a tally of 30, the same country ethnicity and party variables but differs in names and all other variables that I haven't listed. How would I drop these duplicates and keep only one observation of an ethnicity under a party? – Emir Dakin Jan 03 '20 at 14:48
  • Select the variables you don't want like name and then use dplyr distinct – Bruno Jan 03 '20 at 14:56
1

You can use dplyr and tidyr:

df %>%
 count(!!!select(., -name)) %>%
 group_by(country) %>%
 complete(ethnicity, nesting(party), fill = list(n = 0))

   country     ethnicity party       n
   <chr>       <chr>     <fct>   <dbl>
 1 Afghanistan Hazara    X Party     1
 2 Afghanistan Hazara    Y Party     0
 3 Afghanistan Pashtun   X Party     1
 4 Afghanistan Pashtun   Y Party     1
 5 Afghanistan Tajik     X Party     1
 6 Afghanistan Tajik     Y Party     0
 7 Bosnia      Croat     C Party     1
 8 Bosnia      Croat     P Party     0
 9 Bosnia      Serb      C Party     0
10 Bosnia      Serb      P Party     2
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
  • Thanks but this code took too long too execute on my Mac, don't know why. However it was very influential to reach my goal. – Emir Dakin Jan 03 '20 at 17:03