0

I have the following data in messy format:

structure(list(com_level = c("B", "B", "B", "B", "A", "A"), 
                        hf_com = c(1, 1, 1, 1, 1, 1), 
                        sal_level = c("2", "3", "1", "2", "1", "4"), 
                        exp_sal = c(NA, 1, 1, NA, 1, NA)), 
                        class = c("tbl_df", "tbl", "data.frame"), 
                        row.names = c(NA, -6L))

Column com_level is the factor with 2 levels and column hf_com gives the frequency count for that level.

Column sal_level is the factor with 4 levels and column exp_sal gives the frequency count for that level.

I want to create a contingency table similar to this:

structure(list(`1` = c(1L, 2L), 
               `2` = c(0L, 1L), 
               `3` = c(0L, 2L), 
               `4` = c(1L, 0L)), 
               row.names = c("A", "B"), class = "data.frame")

I have code that works when I want to compare two columns with the same factor:

# 1 step to create table with frequency counts for exp_sal and curr_sal per category of level
cs_es_table <- df_not_na_num %>%
  dplyr::count(sal_level, exp_sal, curr_sal) %>%
  tidyr::spread(key = sal_level,value = n) %>% # this code spreads on just one key
  select(curr_sal, exp_sal, 1, 2, 3, 4, 5, 6, 7, -8) %>% # reorder columns and omit Column 8 (no answer) 
  as.data.frame()

# step 2- convert cs_es_table to long format and summarise exp_sal and curr_sal frequencies
cs_es_table <- cs_es_table %>% 
  gather(key, value, -curr_sal,-exp_sal) %>% # crucial step to make data long
  mutate(curr_val = ifelse(curr_sal == 1,value,NA),
         exp_val = ifelse(exp_sal == 1,value,NA)) %>% #mutate actually cleans up the data and assigns a value to each new column for 'exp' and 'curr'
  group_by(key) %>% #for your summary, because you want to sum up your previous rows which are now assigned a key in a new column
  summarise_at( .vars = vars(curr_val, exp_val), .funs = sum, na.rm = TRUE)

This code produces this table but just spreads on one key in step 1:

structure(list(curr_val = c(533L, 448L, 237L, 101L, 56L), exp_val = c(179L, 
577L, 725L, 401L, 216L)), row.names = c("< 1000 EUR", "1001-1500 EUR", 
"2001-3000 EUR", "3001-4000 EUR", "4001-5000 EUR"), class = "data.frame")

Will I need to use pivot_wider as in this example? Is it possible to use spread on multiple columns in tidyr similar to dcast?

or

tidyr::spread() with multiple keys and values

Any help would be appreciated to compare the two columns with different factors.

EJG_27
  • 111
  • 10
  • 1
    Can you supply a reproducible example of what you are trying to achieve i.e. for the given data at the start what would your expected output be? Your expected output currently has values that are not in your original dataset e.g. `C`, `D`, `E`, `F` – user63230 Nov 26 '19 at 18:00
  • 1
    the data provided, doesn't match up with your code. ex, there is no curr_sal or exp_sal in your dataset... – Shirin Yavari Nov 26 '19 at 18:54
  • I have changed the input and output to give a reproducible example. The code is for 2 columns with 1 factor – EJG_27 Nov 26 '19 at 21:14
  • are you sure your desired output makes sense? it says you have 3 `A`'s in total but you only have 2 in the input data – user63230 Nov 27 '19 at 22:27
  • Sorry made a mistake in the desired result. I am working on my solution for this problem; I hope to post something tomorrow. – EJG_27 Nov 28 '19 at 08:37

0 Answers0