0

I have data from multiple files that have data hourly observations for different variables. I'm trying to combine them to a single dataframe where I have every observation for each variable for the whole time interval. Some variables have data in multiple dataframes for different time intervals. Each dataframe has a time column that can be used for joining.

The problem is that full_join creates more rows than my data has hours (df1). Instead I would like to get a dataframe (df2) without NA values and extra rows. One solution is to join the dataframes in specific order but I'm hoping for a more general solution that works with larger scale for combining datasets for this type of data.

library(tidyverse)
a <- data.frame(hour = 1:10, x = 5)
b <- data.frame(hour = 1:15, y = 15)
c <- data.frame(hour = 11:20, x = 10)

list <- list(a,b,c)

df1 <- list %>%
  reduce(full_join)

df1
   hour  x  y
1     1  5 15
2     2  5 15
3     3  5 15
4     4  5 15
5     5  5 15
6     6  5 15
7     7  5 15
8     8  5 15
9     9  5 15
10   10  5 15
11   11 NA 15
12   12 NA 15
13   13 NA 15
14   14 NA 15
15   15 NA 15
16   11 10 NA
17   12 10 NA
18   13 10 NA
19   14 10 NA
20   15 10 NA
21   16 10 NA
22   17 10 NA
23   18 10 NA
24   19 10 NA
25   20 10 NA

df2 <- full_join(a,c) %>%
  full_join(b)

df2
   hour  x  y
1     1  5 15
2     2  5 15
3     3  5 15
4     4  5 15
5     5  5 15
6     6  5 15
7     7  5 15
8     8  5 15
9     9  5 15
10   10  5 15
11   11 10 15
12   12 10 15
13   13 10 15
14   14 10 15
15   15 10 15
16   16 10 NA
17   17 10 NA
18   18 10 NA
19   19 10 NA
20   20 10 NA


EDIT: I got the correct result with tidyverse with following code:

df <- list %>%
  reduce(full_join) %>%
  group_by(hour) %>%
  summarise_all(mean, na.rm = T)

With data.table meltand dcast also seem to work

dt1 <- as.data.table(df1)
m <- as.data.table(melt(dt1, id.vars = "hour"))
dc <- dcast.data.table(m, hour~variable, fun.aggregate = mean, na.rm = T)
df <- as.data.frame(dc)

I guess mean can be replaced with some other function but I couldn't figure which.

SPI_4324
  • 15
  • 5
  • How is the join supposed to handle the case where `b` contains also information of `x`? For example `b <- data.frame(hour = 1:15, x = 15, y = 15)`. In this case you would have multiple values for `x` on `hour %in% 1:10`. This is relevant for a general approach for this I guess. – Therkel Apr 13 '23 at 07:44
  • Either way, perhaps you have to left join all information onto a new created dataset that just contains all hours? `library(magrittr); lapply(list,extract2,"hour") %>% unlist %>% unique` – Therkel Apr 13 '23 at 07:48
  • Please don't insert "EDIT"s/"UPDATE"s, just make your post the best presentation as of edit time. But--Please don't edit a question in a way that invalidates reasonable posted answers. – philipxy Apr 14 '23 at 22:53
  • Please before considering posting: Pin down code issues via [mre]. Read manuals/references & google error messages & many clear, concise & precise phrasings of your question/problem/goal, with & without your particular names/strings/numbers, 'site:stackoverflow.com' & tags; read many answers. SO/SE search is poor & literal & unusual, read the help. Google re googling/searching, including Q&A at [meta] & [meta.se]. [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) [ask] [Help] Reflect research in posts. – philipxy Apr 14 '23 at 22:53

2 Answers2

1

First, you need to specify the by keyword in join. Here is a solution using coalesce_join from E. Visel blog post. I slightly modified it to use coalesce only in the case when there are common columns in the two dataframes to join, otherwise it fails because the coalesced dataframe is empty.

library(tidyverse)

a <- data.frame(hour = 1:10, x = 5)
b <- data.frame(hour = 1:15, y = 15)
c <- data.frame(hour = 11:20, x = 10)

l <- list(a,b,c)

coalesce_join <- function(x, y, 
                              by = NULL, suffix = c(".x", ".y"), 
                              join = dplyr::full_join, ...) {
        joined <- join(x, y, by = by, suffix = suffix, ...)
        # names of desired output
        cols <- union(names(x), names(y))
        common_cols <- setdiff(intersect(names(x), names(y)), by)
        if(length(common_cols) > 0) {
          # if there are common columns, we need to coalesce them
          to_coalesce <- names(joined)[!names(joined) %in% cols]
          suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
          # remove suffixes and deduplicate
          to_coalesce <- unique(substr(
              to_coalesce, 
              1, 
              nchar(to_coalesce) - nchar(suffix_used)
          ))
          
          coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
              joined[[paste0(.x, suffix[1])]], 
              joined[[paste0(.x, suffix[2])]]
          ))
          names(coalesced) <- to_coalesce

          joined <- dplyr::bind_cols(joined, coalesced)[cols]
      }
      return(joined)
}

reduce(l, coalesce_join, by="hour")

This snippet gives me the desired output without caring for the column names.

NB: According to the documentation, coalesce finds the first non-missing element, so you have to be aware of this behaviour if there are overlapping values of x in several dataframes.


NB: I renamed l to be your list object. I believe it is not good for code clarity to overload language-specific keywords.

paulduf
  • 163
  • 2
  • 14
  • Finally, there are already [plenty of StackO threads about joining dataframes](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right). – paulduf Apr 13 '23 at 07:42
  • Thanks for the comment. I should add that I would like to get a single x column (indentical to df2) instead of x.x and x.y columns as the x is the same variable in both dataframes (just different time interval). – SPI_4324 Apr 13 '23 at 07:44
  • A note to your reassign comment: the function `list` is not reassigned here, it is just overloaded. `append(list,list(d,e))` still works like expected. `list <- function() {...}` would be a reassign. – Therkel Apr 13 '23 at 07:54
  • I edited my post, `coalesce` looks like the right function to use here. The solution is not yet fully automated, as you have to specify column names in `mutate`. Googling around, E. Visel proposes [in a blog post](https://alistaire.rbind.io/blog/coalescing-joins/) the `coalesce_join` function. This might be what you are looking for. – paulduf Apr 13 '23 at 08:19
  • I tried group_by and summarise_all with na.omit which removes some rows. `df1 <- list %>% reduce(full_join) %>% group_by(hour) %>% summarise_all(na.omit)` I will try coalesce next but I would like a solution where I don't have to name the columns that have to be combined. – SPI_4324 Apr 13 '23 at 08:26
  • OK, from the blog post mentioned above I managed to obtain something that works in your case without naming the columns. I edited my answer. – paulduf Apr 13 '23 at 14:45
  • Please don't insert "EDIT"s/"UPDATE"s, just make your post the best presentation as of edit time. Frequently-duplicate questions should be closed, not answered. – philipxy Apr 14 '23 at 22:50
1

Here is a version where the primary key can be a composite (for example hour and group). It doesn't work with aggregating like your own solution you added to the question.

## Identify (composite) primary key 
mykeys <- c("hour")

## Using the data.table package below
list %>%
    lapply(setDT)

## Create a dataset with all unique combinations of the keys to join onto
full <- lapply(list, function(dt) dt[,..mykeys]) %>%
    rbindlist %>%
    unique

## Can make the keys as actual keys on the full dataset but not necessary
# setkeyv(full,mykeys)

## Join all variables onto the full dataset.
## The preserved values of duplicates are from the last dataset in the order of 
## the list object. This step is done by reference.
list %>%
    lapply(function(dt) {
        n <- names(dt)
        full[dt,
             on = mykeys,
             (n) := mget(paste0("i.", n))]
    })
full

Here is an example dataset with a composite primary key and more columns to help identify what is going on. Again, note that the order of the datasets in list matter.

mykeys <- c("hour"
            ,"group"
)

a <- data.frame(hour = 1:10, 
                group = 1,
                x = 5, z = 1, a = 1)
b <- data.frame(hour = 1:15, 
                group = 1,
                y = 15, z = 2, b = 1)
c <- data.frame(hour = 11:20, 
                group = 1,
                # group = 2,
                x = 10, z = 3, c = 1)
Therkel
  • 1,379
  • 1
  • 16
  • 31