3

I know the answer to this question will be simple but I have searched the forums extensively and I have been unable to find a solution.

I have a column called Data_source which is a factor that I want to group my variables by.

I have a series of symptom* variables where I want the counts according to Data_source.

For some reason, I am unable to figure out how to do this. The normal group_by functions do not seem to work appropriately.

Here is the dataframe in question

 df <- wrapr::build_frame(
   "Data_source"  , "Sex"   , "symptoms_decLOC", "symptoms_nausea_vomitting" |
     "1"          , "Female", NA_character_    , NA_character_               |
     "1"          , "Female", NA_character_    , NA_character_               |
     "1"          , "Female", "No"             , NA_character_               |
     "1"          , "Female", "Yes"            , "No"                        |
     "1"          , "Female", "Yes"            , "No"                        |
     "1"          , "Female", "Yes"            , "No"                        |
     "1"          , "Male"  , "Yes"            , "No"                        |
     "1"          , "Female", "Yes"            , "No"                        |
     "2"          , "Female", NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Female", "Yes"            , "No"                        |
     "2"          , "Female", "Yes"            , "No"                        |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Female", NA_character_    , NA_character_               |
     "2"          , "Female", NA_character_    , NA_character_               |
     "2"          , "Male"  , NA_character_    , NA_character_               |
     "2"          , "Female", NA_character_    , NA_character_               )

Notice that Sex and the symptoms variables are all factors which include NA's. I have attempted the following

df %>% na.omit() %>% group_by(Data_source) %>% count("symptoms_decLOC")

Which does not work and is less than optimal because I would have to repeat it for every column. The ideal would be to use something similar to lapply(df, count) but this does not give me description for each group.

EDIT

In response to question below, I have added the expected output. I have edited this in excel, color coding the group_by for clarity.

enter image description here

Notice how I am getting a break down for each possible answer. When I run this using dplyr here is the output.

> df %>% na.omit() %>% group_by(Data_source) %>% count("symptoms_decLOC")
# A tibble: 2 x 3
# Groups:   Data_source [2]
  Data_source `"symptoms_decLOC"`     n
  <chr>       <chr>               <int>
1 1           symptoms_decLOC         5
2 2           symptoms_decLOC         2
Patrick
  • 915
  • 2
  • 9
  • 26

4 Answers4

2

This gets most of the way: haven't figured out how to include zero-count groups yet ... supposedly adding .drop=FALSE takes care of this, but it's not working for me (using dplyr v. 0.8.0.9001).

library(dplyr)
library(tidyr)
(df
    %>% tidyr::gather(var,val,-Data_source)
    %>% count(Data_source,var,val, .drop=FALSE)
    %>% na.omit()
)

Results:

  Data_source var                       val        n
  <chr>       <chr>                     <chr>  <int>
1 1           Sex                       Female     7
2 1           Sex                       Male       1
3 1           symptoms_decLOC           No         1
4 1           symptoms_decLOC           Yes        5
5 1           symptoms_nausea_vomitting No         5
6 2           Sex                       Female     6
7 2           Sex                       Male       6
8 2           symptoms_decLOC           Yes        2
9 2           symptoms_nausea_vomitting No         2
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Using this syntax I don't get the same output as you. I get `Error in count(., Data_source, var, val, .drop = FALSE) : unused arguments (val, .drop = FALSE)` – Patrick May 06 '19 at 11:59
  • what are the results of `find("count")` (and packageVersion("dplyr")`) ? – Ben Bolker May 06 '19 at 12:25
  • [Here is a link to the output](https://i.imgur.com/1BR09Gk.png) `> find("count") [1] "package:plyr" "package:dplyr" > packageVersion("dplyr") [1] ‘0.8.0.1’` – Patrick May 06 '19 at 12:26
  • 1
    Alright so since both packages are being loaded if I change your code to `dplyr::count` I get the expected output! Thanks – Patrick May 06 '19 at 12:30
  • You probably got a warning that some functions were being masked when you loaded `plyr` after `dplyr` ... – Ben Bolker May 06 '19 at 16:01
1

Using @Ben Bolker's answer to get counts for each group, using spread and gather to include zero count groups.

dplyr

library(dplyr)
library(tidyr)

# Count number of occurences by Data_source 
df2 <- 
  df %>% 
  gather(variable, value, -Data_source) %>% 
  count(Data_source, variable, value, name = "counter") %>%
  na.omit() 

# For variable = "Sex", leave as is
# For everything else, in this case symptom* convert into factor to include zero count group
# Then spread with dataframe will NAs filled with 0, re-convert back to long to bind rows
bind_rows(df2 %>%
            filter(variable == "Sex"), 

          df2 %>%
            filter(variable != "Sex") %>%
            mutate(value = factor(value, levels = c("Yes", "No"))) %>%
            spread(key = value, value = counter, fill = 0) %>%
            gather(value, counter, -Data_source, -variable))  %>%

  arrange(Data_source, variable)

data.table

library(data.table)
dt <- data.table(df)

# Melt data by Data source
dt_melt <- melt(dt, id.vars = "Data_source", value.factor = FALSE, variable.factor = FALSE)

# Add counter, if NA then 0 else 1
dt_melt[, counter := 0]
dt_melt[!is.na(value), counter := 1]

# Sum number of occurrences
dt_count <- dt_melt[,list(counter = sum(counter)), by = c("Data_source", "variable", "value")]

# Split into two dt
dt2a <- dt_count[variable == "Sex", ]
dt2b <- dt_count[variable != "Sex" ,]

# only on symptoms variables
# Convert into factor variable
dt2b$value <- factor(dt2b$value, levels = c("Yes", "No"))
dt2b_dcast <- dcast(data = dt2b, formula = Data_source + variable ~ value, value.var = "counter", fill = 0, drop = FALSE)
dt2b_melt <- melt(dt2b_dcast, id.vars = c("Data_source", "variable"), variable.name = "value", value.name = "counter") 

# combine
combined_d <- rbind(dt2a, dt2b_melt)
combined_d[order(Data_source, variable), ]
MKa
  • 2,248
  • 16
  • 22
  • Ya I am still unable to get the dplyr method working. Data.table does work but it adds a tremendous amount of code. – Patrick May 06 '19 at 12:06
0

I don't quite understand what you're asking, but I'll asume you want to count the number of non-NA values in each of your symptom_* columns.

This is a data.table solution:

# load library

library(data.table)

# Suppose the table is called "dt". Convert it to a data.table:

setDT(dt)

# convert the wide table to a long one, filter the values that
# aren't NA and count both, by Data_source and by variable
# (variable is the created column with the symptom_* names)

melt(dt, id.vars = 1:2)[!is.na(value), 
                        .N, 
                         by = .(Data_source, variable)]

What each part of the code is doing:

melt(dt, id.vars = 1:2) converts dt from wide to long, and keeps columns 1 and 2 (Data_source and sex) as fixed.

!is.na(value) filters the values (that were previously under each symptom_* header) that are not NA.

.N counts the rows.

by = .(Data_source, variable) is the grouping we are using to count. variable is the name of the column where the symptom_* landed during the reshaping.

PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • When I run this I get the following Data_source variable N 1: 1 symptoms_decLOC 6 2: 2 symptoms_decLOC 2 3: 1 symptoms_nausea_vomitting 5 4: 2 symptoms_nausea_vomitting 2 This is not giving me stats for each individual response. – Patrick Apr 23 '19 at 22:25
0

Definitely, the hard thing is to keep combinations that don't exist in the data... Here is a solution in two steps :

1. Prepare a database without count

You can do whatever you want, but I've chosen to compute two chunks since the modalities are different for the variable Sex. No need to bind those chunks here.

chunk1 <- expand.grid(
  Data_source = c("1", "2"),
  name = c("symptoms_decLOC", "symptoms_nausea_vomitting"),
  value = c("Yes", "No"),
  stringsAsFactors = FALSE
)

chunk2 <- expand.grid(
  Data_source = c("1", "2"),
  name = "Sex",
  value = c("Female", "Male"),
  stringsAsFactors = FALSE
)

2. Finish the job asked

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(cols = c("Sex", "symptoms_decLOC", "symptoms_nausea_vomitting"))%>%
  group_by(Data_source, name, value) %>%
  summarise(count = n()) %>%
  right_join(bind_rows(chunk1, chunk2), by = c("Data_source", "name", "value")) %>%
  arrange(Data_source, name) %>%
  mutate(count = zoo::na.fill(count, 0))

Et voilà

# A tibble: 12 x 4
# Groups:   Data_source, name [6]
   Data_source name                      value  count
   <chr>       <chr>                     <chr>  <int>
 1 1           Sex                       Female     7
 2 1           Sex                       Male       1
 3 1           symptoms_decLOC           Yes        5
 4 1           symptoms_decLOC           No         1
 5 1           symptoms_nausea_vomitting Yes        0
 6 1           symptoms_nausea_vomitting No         5
 7 2           Sex                       Female     6
 8 2           Sex                       Male       6
 9 2           symptoms_decLOC           Yes        2
10 2           symptoms_decLOC           No         0
11 2           symptoms_nausea_vomitting Yes        0
12 2           symptoms_nausea_vomitting No         2

It is not so short, but it uses simple functions. The process is similar as that one can do in Excel i.e., prepare the structure and then complete the counts.

I hope it could help ;-)