2

I'd like to see if there is a more compact way to complete the following task using DPLYR. I want to go from this:

id  name    value   average max min
1   chris   8           
1   chris   5           
1   chris   3           
1                   
1           2           
2   tom     12          
2   tom     6           
2   tom     4           
2   tom             
2                   

to this:

id  name    value   average max min
1   chris   8       4.5     8   2
1   chris   5       4.5     8   2
1   chris   3       4.5     8   2
1                   4.5     8   2
1           2       4.5     8   2
2   tom     12      7.3     12  4
2   tom     6       7.3     12  4
2   tom     4       7.3     12  4
2   tom             7.3     12  4
2                   7.3     12  4

The values are grouped by id. Right now I do the following series of commands:

Step 1: take the average by id and create a new df:

library(dplyr)
new_df <- df %>%
  group_by(id) %>%
  summarise_each(funs(mean(value, na.rm=TRUE)))

Step 2: Remove all variables except average in new_df

Step 3: Merge new_df back to master

Is there a way to direct the results from DPLYR directly into a variable based on an ID? Thank you.

New Code

df <- df %>%
  group_by(id) %>%
  mutate_each(funs(average = mean(value, na.rm = TRUE),
                   min = min(value, na.rm = TRUE),
                   max = max(value, na.rm = TRUE)))

Data Frame Code

id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2) 
name = c("chris", "chris", "chris", "", "", "tom","tom","tom","tom","") 
value = c(8,5,3,"",2,12,6,4,"","") 
average = c(1:10)
max = c("","","","","","","","","","")
min = c("","","","","","","","","","")
df = data.frame(id, name, value, average, max, min)
DCRubyHound
  • 333
  • 2
  • 11
  • Please share your data in a [reproducible format](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Not sure what those blank values are supposed to represent. Are they NA? or a space? or a zero-length string? – MrFlick Dec 08 '16 at 17:00
  • 1
    Hi @MrFlick, I added the data frame code above. Thank you. – DCRubyHound Dec 08 '16 at 19:09

3 Answers3

4

Probably like this:

library(dplyr)
df <- df %>%
  group_by(id) %>%
  mutate(average = mean(value, na.rm = TRUE),
         min = min(value, na.rm = TRUE),
         max = max(value, na.rm = TRUE))
  • thanks, @Wojciech. This gets the values into the right columns, but it's doing calculations based on all the values and not the group id. – DCRubyHound Dec 08 '16 at 17:28
  • Well, copying your input data and executing my code gives exactly your desired output – Wojciech Książek Dec 08 '16 at 17:37
  • thanks, @wojciech. I tried your code a couple more times and am still getting it to only report averages, mins and maxs for all values not for two different ids (1 and 2). Through some trial and error I think I figured out what was wrong (at least on my end) and when I add mutate_each, it will break up the calculations by group, but it also creates many new variables in the process. Any idea how to get the new code(above) to publish results in the three assigned columns? Thank you. – DCRubyHound Dec 08 '16 at 18:33
3

To avoid the factor trap, use the tibble wrapper data_frame to build the data frame.

df = data_frame(id, name, value, average, max, min)

since the value column is character typed due to the presence of "", it needs to be turned to a numeric. Happily, this also transforms your ""s into NA's. then (in just one step) screen out the NA's from the calculations with the mutate/summarise friendly na.omit().

df1 <- 
    select(df, id, value) %>%
    mutate(value = as.numeric(value)) %>%
    na.omit %>% 
    group_by(id) %>% 
    summarise(average = mean(value),
              max = max(value),
              min = min(value)
             ) %>% 
    right_join(select(df, id, name, value), .)

> df1
# A tibble: 10 × 6
      id  name value  average   max   min
   <dbl> <chr> <chr>    <dbl> <dbl> <dbl>
1      1 chris     8 4.500000     8     2
2      1 chris     5 4.500000     8     2
3      1 chris     3 4.500000     8     2
4      1             4.500000     8     2
5      1           2 4.500000     8     2
6      2   tom    12 7.333333    12     4
7      2   tom     6 7.333333    12     4
8      2   tom     4 7.333333    12     4
9      2   tom       7.333333    12     4
10     2             7.333333    12     4
leerssej
  • 14,260
  • 6
  • 48
  • 57
  • My pleasure! Fun to wrangle, and happy to help! :-D – leerssej Dec 09 '16 at 20:07
  • @leerseej. Quick question for you. Most of the time, this code works perfectly, but every once in a while I get the following error message: No common variables. Please specify `by` param. When I restart R and run the code again, I don't receive the error. Any idea how to avoid this error? – DCRubyHound Dec 12 '16 at 16:18
  • Hmm. It sounds like you have two dataframes with the same name but different variables? Maybe a temp file in a sequence is lingering around and attempting to be folded in with the new cycle? Have you tried clearing the environment, with the broom icon in R studio, or by using `rm(list = ls())`. It will wipe your slates clean without having to restart R entirely. – leerssej Dec 12 '16 at 17:51
  • yes, it still persist even after clearing the environment. Also, it only pops up on my PC at work, but not on my personal computer (mac). So odd. Thank you. – DCRubyHound Dec 12 '16 at 18:34
  • There wouldn't be a `library(plyr)` getting loaded by any chance? Wild guess, but this is often a source of creation for fantastical oddities. – leerssej Dec 18 '16 at 18:12
0

How about:

library(dplyr)
df %>%
  group_by(id) %>%
  mutate(average = mean(value, na.rm = TRUE),
         min = min(value, na.rm = TRUE),
         max = max(value, na.rm = TRUE)) %>% 
  left_join(df)
Jacob
  • 3,437
  • 3
  • 18
  • 31