Subset by all unique values in multiple columns

Question

Say I have a data frame with one column of prices and multiple columns of dates. I'd like to calculate the mean prices of all observations by each date.

I thought about subsetting with split(), but I don't know how to apply it across multiple columns.

My example would be like this:

df
  Price  Date.1    Date.2      Date.3  
1  10   10/22/22   11/23/22    12/31/22     
2  18   8/1/22     10/22/22          
3  67   8/1/22     11/23/22       
4  23   10/11/22   10/22/22   
5  38   10/22/22   11/23/22    12/31/22    
6  41   6/7/22

I would like to have mean prices by each date, so would be something like:

6/7.   mean of price at row 6
8/1    of row 2, 3
10/11. of row 4
10/22. of row 1, 2, 4, 5
11/23. of row 1, 3, 5
12/31. of row 1, 5

Thanks for any help!

Welcome to stack overflow. It's easier to help you if you make your question reproducible by including data in a useable format eg paste the output of `dput(your_data)` into the question to enable testing and verification of possible solutions. [link for guidance on asking questions](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Peter, Jul 01 '23 at 18:56

score 0 · Answer 1 · answered Jul 01 '23 at 19:04

You can use dplyr and tidyr to change your data to a long format, eliminate entries with empty dates and then summarise the price values by Date.

df |>
  # Transform data into long format
  pivot_longer(cols = starts_with("Date"),
               names_to = "n.Date",
               values_to = "Date") |>
  # set empty dates to NA
  mutate(across(Date, ~ na_if(.x, ""))) |>
  # drop NA
  drop_na(Date) |>
  # Summarise mean Price values by date
  summarise(mean = mean(Price),
            .by = Date)

# A tibble: 6 × 2
#  Date      mean
#  <chr>    <dbl>
#1 10/22/22  22.2
#2 11/23/22  38.3
#3 12/31/22  24  
#4 8/1/22    42.5
#5 10/11/22  23  
#6 6/7/22    41

A suggestion: add the argument `values_drop_na = TRUE` in your call to `pivot_longer`, so you can obviate the need for `# set empty dates to NA mutate(across(Date, ~ na_if(.x, ""))) |> # drop NA drop_na(Date) |>` — GuedesBF, Jul 01 '23 at 22:06

score 0 · Answer 2 · answered Jul 01 '23 at 19:47

First you may want to reshape your data to long format, then aggregate the mean.

reshape(dat, varying=2:4, direction='long') |> aggregate(Price ~ Date, mean)
#       Date    Price
# 1 10/11/22 23.00000
# 2 10/22/22 22.25000
# 3 11/23/22 38.33333
# 4 12/31/22 24.00000
# 5   6/7/22 41.00000
# 6   8/1/22 42.50000

Looks even better, if you use appropriate "Date" format.

dates <- grep('^Date', names(dat))
dat[dates] <- lapply(dat[dates], as.Date, format='%m/%d/%y')

reshape(dat, varying=2:4, direction='long') |> aggregate(Price ~ Date, mean)
#        value    Price
# 1 2022-06-07 41.00000
# 2 2022-08-01 42.50000
# 3 2022-10-11 23.00000
# 4 2022-10-22 22.25000
# 5 2022-11-23 38.33333
# 6 2022-12-31 24.00000

Data:

dat <- structure(list(Price = c(10L, 18L, 67L, 23L, 38L, 41L), Date.1 = c("10/22/22", 
"8/1/22", "8/1/22", "10/11/22", "10/22/22", "6/7/22"), Date.2 = c("11/23/22", 
"10/22/22", "11/23/22", "10/22/22", "11/23/22", NA), Date.3 = c("12/31/22", 
NA, NA, NA, "12/31/22", NA)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

Subset by all unique values in multiple columns

2 Answers2