1

Here is my Data.frame

New = (data.frame(ID=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,4,4), 
                DC=c("Qualx","lmx", "lmx","lmx","lmx", "Qualx","Qualx","Qualx",
                     "lmx","lmx", "lmx", "Qualx", "Qualx","Qualx","Qualx","Qualx","lmx", "Qualx", "Qualx", "Qualx")))

Now I would like to group by (ID,DC), and then extract counts or frequencies(percent * 100 format)

My approach using dplyr:

New1 <- New %>%
  group_by(ID,DC) %>%
  mutate(count=n())%>%
  mutate(freq = count / sum(count))

However, my freq column seems to be displaying wrong information.

Once I do get my frequency values, I would like to mutate again, and get another column based on ifelse - something like:

%>% mutate(n = ifelse(freq == .5, DC, 'Unknown')

however, when I perform the above operation, I keep running into various errors.

I also tried:

D_F_P <- New %>%
  group_by(ID,DC) %>% 
  table() %>% 
  data.frame() %>% 
  mutate(freq = Freq / sum(Freq)) %>% 
  mutate(assign = ifelse(freq == .1, DC, 'Unknown'))

The above operation provides a numeric value for 'assign' column instead of returning the string value present in DC column, like this:

      ID    DC Freq freq  assign
1  1   lmx    5 0.25 Unknown
2  2   lmx    2 0.10       1
3  3   lmx    1 0.05 Unknown
4  4   lmx    0 0.00 Unknown
5  1 Qualx    4 0.20 Unknown
6  2 Qualx    5 0.25 Unknown
7  3 Qualx    1 0.05 Unknown
8  4 Qualx    2 0.10       2

Instead I want it to display

      ID    DC Freq freq  assign
1  1   lmx    5 0.25 Unknown
2  2   lmx    2 0.10     lmx
3  3   lmx    1 0.05 Unknown
4  4   lmx    0 0.00 Unknown
5  1 Qualx    4 0.20 Unknown
6  2 Qualx    5 0.25 Unknown
7  3 Qualx    1 0.05 Unknown
8  4 Qualx    2 0.10   Qualx

My main goal is to group by (ID,CD), then get frequencies (percentage*100), then use an ifelse statement that returns values in DC column. Any help would be appreciated. You don't even have to use my approach, any approach related to 'dplyr' in your personal way would also be appreciated. Thank you

fugu
  • 6,417
  • 5
  • 40
  • 75

2 Answers2

2

DC is a factor, use as.character, or better, fix it upstream, and you'll get the desired output:

New %>%
  group_by(ID,DC) %>%
  table() %>%
  data.frame() %>%
  mutate(freq = Freq / sum(Freq))%>%
  mutate(assign = ifelse(freq == .1, as.character(DC), 'Unknown'))

# ID    DC Freq freq  assign
# 1  1   lmx    5 0.25 Unknown
# 2  2   lmx    2 0.10     lmx
# 3  3   lmx    1 0.05 Unknown
# 4  4   lmx    0 0.00 Unknown
# 5  1 Qualx    4 0.20 Unknown
# 6  2 Qualx    5 0.25 Unknown
# 7  3 Qualx    1 0.05 Unknown
# 8  4 Qualx    2 0.10   Qualx

You can use dplyr::if_else instead of ifelse in the future to have these situations trigger an explicit error.

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
0

Slightly different than @Moody's, using summarize instead, then you need to ungroup as to remove the remaining group_by group, which would be ID

New = data.frame(ID=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,4,4), 
                  DC=c("Qualx","lmx", "lmx","lmx","lmx", "Qualx","Qualx","Qualx",
                       "lmx","lmx", "lmx", "Qualx", "Qualx","Qualx","Qualx","Qualx","lmx", "Qualx", "Qualx", "Qualx"),
       stringsAsFactors = FALSE)

New %>%
  group_by(ID, DC) %>%
  summarize(count = n()) %>%
  ungroup() %>%
  mutate(freq = count / sum(count),
         n = if_else(freq == 0.1, DC, "Unkown"))

# A tibble: 7 x 5
     ID DC    count  freq n     
  <dbl> <chr> <int> <dbl> <chr> 
1     1 lmx       5  0.25 Unkown
2     1 Qualx     4  0.2  Unkown
3     2 lmx       2  0.1  lmx   
4     2 Qualx     5  0.25 Unkown
5     3 lmx       1  0.05 Unkown
6     3 Qualx     1  0.05 Unkown
7     4 Qualx     2  0.1  Qualx 

edit per OP's Comment

New %>%
  group_by(ID, DC) %>%
  summarize(count = n()) %>%
  mutate(freq = count / sum(count),
         n = if_else(freq == 0.1, DC, "Unkown"))

# A tibble: 7 x 5
# Groups:   ID [4]
     ID DC    count  freq n     
  <dbl> <chr> <int> <dbl> <chr> 
1     1 lmx       5 0.556 Unkown
2     1 Qualx     4 0.444 Unkown
3     2 lmx       2 0.286 Unkown
4     2 Qualx     5 0.714 Unkown
5     3 lmx       1 0.5   Unkown
6     3 Qualx     1 0.5   Unkown
7     4 Qualx     2 1     Unkown

Note that the summarize statement removes a layer of grouping, so the resulting data.frame will be grouped by ID.

zack
  • 5,205
  • 1
  • 19
  • 25
  • I was only solving the issue with `factors` and didn't try to optimize the code, if you must I suggest this, using `count` in a more compact way and using `replace` as it's a bit cleaner imo : `New %>% count(ID, DC) %>% mutate(freq = n / sum(n), assign = replace(DC, freq != 0.1, "Unkown"))` – moodymudskipper Jun 22 '18 at 15:33
  • cool solution, but I get an error when assigning `"unknown"` as it's not a valid factor level. – zack Jun 22 '18 at 15:47
  • either you use the `as.character` of my answer or you use your definition of the `df`. And it should work fine – moodymudskipper Jun 22 '18 at 15:56
  • ah right - I misinterpreted what your prior comment was getting at, for what it's worth - it looks like the `summarize` version is a little faster, but likely won't make a difference at the intended scale. – zack Jun 22 '18 at 15:58
  • Guys these are great responses, but i don't want to un-group them because I will be adding more ifelse statements that will be carried out within each groups. Also why is the frequency value not right. Within group ID frequency should be (5 / (5+4)) = 0.55556 not 0.25 (for the first line), second line should be (4 / (5+4)) = 0.4444444 etc. maybe I should I mentioned it again. I definitely need to have these going within groups. Again thanks for the help so far. – Jason Smith Jun 25 '18 at 18:21
  • We were replicating the desired output that was specified in the original post - which doesn't look like it does within-group frequencies. You can just remove the `ungroup()` statement to get the within `ID` frequency. I've edited the answer to what I think you're looking for. – zack Jun 25 '18 at 18:37
  • Thanks Zack, probably the answer I want. Any idea why I get the error Error: This function should not be called directly. ? – Jason Smith Jun 25 '18 at 18:52
  • Not sure, but a quick google makes me think masked function. [Check out this thread](https://stackoverflow.com/questions/22801153/dplyr-error-in-n-function-should-not-be-called-directly) . If it's `summarize`, try `dplyr::summarize` – zack Jun 25 '18 at 18:57
  • Thanks Zack. The following worked: detach("package:plyr", unload=TRUE) library(dplyr) New %>% group_by(ID, DC) %>% dplyr::summarise(count = n()) %>% mutate(freq = count / sum(count), n = ifelse(freq == 0.5, as.character(DC), "Unkown")) – Jason Smith Jun 25 '18 at 19:08
  • glad to hear it, happy coding – zack Jun 25 '18 at 19:09