0

I am very new to this, and most guides either don't return the desired result or go right over my head. group_by followed by summarise allows me to compute mean / median from these rows, but the returning table does not have a reduced number of rows.

Sample of my data attached.

structure(list(S = c("Balaclava", "Balaclava", "Carnegie",  "Carnegie"), Rn = c(3, 2, 2, 2), T = c("h", "u", "t", "u" ), P = c(1690000, 540000, 795000, 6e+05), M = c("S",  "VB", "S", "SP"), D = c(6.6, 6.6, 11.4, 11.4), BR = c(3,  2, 2, 2), BT = c(2, 1, 2, 1), C = c(2, 1, 1, 1), L = c(339,  483, 133, 73), BA = c(159, 51, 104, 61), YB = c(1890,  1970, 2009, 1970)), row.names = c(NA, -4L), class = c("tbl_df",  "tbl", "data.frame"))
data2 <- data1 %>%
   group_by(S) %>%
   summarize(MRn = median(Rn), 
             APA = mean(P), 
             AAA = mean(BA), 
             AAL = mean(L), YB) %>%
   arrange(desc(MRn))

data2

I want to group rows which have column S in common, and generate mean / median values for columns to the right of S, into one row per S entry. The resulting "groups" need to be used towards plotting.

DariusMT7
  • 23
  • 4
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Feb 21 '23 at 15:52
  • Edited original post to add dput of my data. – DariusMT7 Feb 21 '23 at 18:24

1 Answers1

0

The reason summarize should reduce the number of rows is because it is usually used with functions (like mean or median) that return 1 value for the whole group, resulting in 1 row for that group. You do that here, but then at the end you tell it to also return YB (I'm assuming that's what you mean, since there isn't a Y in the data) without any transformation.

If you look at the output, you'll see that you did generate 1 row per group, but then it was duplicated to allow both values of YB to be preserved:

  S           MRn     APA   AAA   AAL    YB
  <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl>
1 Balaclava   2.5 1115000 105     411  1890
2 Balaclava   2.5 1115000 105     411  1970
3 Carnegie    2    697500  82.5   103  2009
4 Carnegie    2    697500  82.5   103  1970

To get the result you want, either remove YB, or apply a function to it that returns a single value (like min or paste0):

data1 %>%
    group_by(S) %>%
    summarize(MRn = median(Rn), 
              APA = mean(P), 
              AAA = mean(BA), 
              AAL = mean(L)) %>%
    arrange(desc(MRn))

# A tibble: 2 × 5
  S           MRn     APA   AAA   AAL
  <chr>     <dbl>   <dbl> <dbl> <dbl>
1 Balaclava   2.5 1115000 105     411
2 Carnegie    2    697500  82.5   103

data1 %>%
    group_by(S) %>%
    summarize(MRn = median(Rn), 
              APA = mean(P), 
              AAA = mean(BA), 
              AAL = mean(L),
              YB  = paste0(YB, collapse=',')) %>%
    arrange(desc(MRn))
# A tibble: 2 × 6
  S           MRn     APA   AAA   AAL YB       
  <chr>     <dbl>   <dbl> <dbl> <dbl> <chr>    
1 Balaclava   2.5 1115000 105     411 1890,1970
2 Carnegie    2    697500  82.5   103 2009,1970

divibisan
  • 11,659
  • 11
  • 40
  • 58
  • This is beginning to make sense, thank you. In the case that I mean to preserve the data in column YB but have too many entries, is there a way to define a range from the lowest to the highest? – DariusMT7 Feb 21 '23 at 18:34
  • Are you looking for a way to show the minimum and maximum value of YB? Sure, but it depends on how you want to do it. You could just paste the `min` and `max` values together instead of all values. Or you could make separate variables: `YB_min=min(YB)`. Or if you want to do calculations with it, you can use the `lubridate` package to make [an interval object](https://stackoverflow.com/questions/41497351/check-if-a-date-is-within-an-interval-in-r) with `interval(parse_date_time(min(YB), '%Y'), parse_date_time(max(YB), '%Y'))` – divibisan Feb 21 '23 at 18:43