0

I'm cleaning a data set and after cleaning duplicates, I would like to merge the rows that share the same observation in a specific column (e.g. ID column).

I am looking to merge/aggregate so that I only have one row per chosen observation (i.e. here: one row per ID) left. If possible, the aggregate row would sum-up all observations but the chosen one to merge (ID).

This would be hypothetical settings:

    set.seed(18)
    dat <- data.frame(ID=c(1,2,1,2,2,3),value=c(5,5,7,8,3,2),location=c("NY","LA","NY","LA","LA","LA"))
    dat

And I would like to know how to obtain

    set.seed(9)
    dat1 <- data.frame(id=c(1,2,3),value=c(5+7,5+8+3,2),location=c("NY","LA","LA"))
    dat1

Which aggregate with respect to ID, sum the observations "value" and pick the corresponding location.

Also, I would like to know if it's possible to group the dataframe with respect to location, such as to obtain:

    set.seed(6)
    dat2 <- data.frame(location=c("NY","LA"),value=c(5+7,5+8+3+2),meanvalue=c(mean(5+7),mean(5+8+3+2)))
    dat2

I did not put ID in this table because in this case, it does not matter: it can be summed or deleted, it's not going to be taken into account for any further computation. I know that my output for meanvalue is wrong: I am looking to get the mean of all rows sharing the same location (i.e. mean for LA and NY). I would appreciate if you also can correct me on this one.

Thank you for your help!

Arthur Lo
  • 13
  • 3
  • Could you please share some reproducible data using `dput`? – Quinten Jun 23 '22 at 17:22
  • t looks like you're new to SO; welcome to the community! If you want great answers quickly, it's best to make your question reproducible. This includes sample data like the output from `dput(head(dataObject))`, code you've tried, errors, warnings, or anything else along those lines from your attempts, and any libraries you are using. Check it out: [making R reproducible questions](https://stackoverflow.com/q/5963269). – Kat Jun 23 '22 at 17:29

1 Answers1

0

I see that you included set.seed but did not see any sampling or randomized procedures (unless I missed something).

One approach with tidyverse is the following. Let me know if this is what you had in mind.

For the first part, use group_by to sum the value based on ID and location:

library(tidyverse)

dat %>%
  group_by(ID, location) %>%
  summarise(sum_value = sum(value))

Output

     ID location sum_value
  <dbl> <chr>        <dbl>
1     1 NY              12
2     2 LA              16
3     3 LA               2

For the second part, if you group_by the location, you can then use sum and mean with summarise:

dat %>%
  group_by(location) %>%
  summarise(sum_value = sum(value), mean_value = mean(value))

Output

  location sum_value mean_value
  <chr>        <dbl>      <dbl>
1 LA              18        4.5
2 NY              12        6  
Ben
  • 28,684
  • 5
  • 23
  • 45