3

I would like to combine two dataframes using crossing, but some have the same columnnames. For that, I would like to add "_nameofdataframe" to these columns. Here are some reproducible dataframes (dput below):

> df1
  person V1 V2 V3
1      A  1  3  3
2      B  4  4  5
3      C  2  1  1
> df2
  V2 V3
1  2  5
2  1  6
3  1  2

When I run the following code it will return duplicated column names:

library(tidyr)
crossing(df1, df2, .name_repair = "minimal")
#> # A tibble: 9 × 6
#>   person    V1    V2    V3    V2    V3
#>   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A          1     3     3     1     2
#> 2 A          1     3     3     1     6
#> 3 A          1     3     3     2     5
#> 4 B          4     4     5     1     2
#> 5 B          4     4     5     1     6
#> 6 B          4     4     5     2     5
#> 7 C          2     1     1     1     2
#> 8 C          2     1     1     1     6
#> 9 C          2     1     1     2     5

As you can see it returns the column names while being duplicated. My desired output should look like this:

  person V1 V2_df1 V3_df1 V2_df2 V3_df2
1      A  1      3      3      1      2
2      A  1      3      3      1      6
3      A  1      3      3      2      5
4      B  4      4      5      1      2
5      B  4      4      5      1      6
6      B  4      4      5      2      5
7      C  2      1      1      1      2
8      C  2      1      1      1      6
9      C  2      1      1      2      5

So I was wondering if anyone knows a more automatic way to give the duplicated column names a name like in the desired output above with crossing?


dput of df1 and df2:

df1 <- structure(list(person = c("A", "B", "C"), V1 = c(1, 4, 2), V2 = c(3, 
4, 1), V3 = c(3, 5, 1)), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(V2 = c(2, 1, 1), V3 = c(5, 6, 2)), class = "data.frame", row.names = c(NA, 
-3L))
Quinten
  • 35,235
  • 5
  • 20
  • 53
  • is `janitor::clean_names()` an option? `janitor::clean_names(cbind(df1, df2))` results in the follofing colnames: `"person" "v1" "v2" "v3" "v2_2" "v3_2" ` – Wimpel Dec 25 '22 at 11:31
  • Hi @Wimpel, it could be an option, but I would prefer to know from which dataframe the column is. With 1 or 2 it is not very clear, while _df1 or _df2 is more clear. Do you understand what I mean? Thanks! – Quinten Dec 25 '22 at 11:33

2 Answers2

2

As you probably know, the .name_repair parameter can take a function. The problem is crossing() only passes that function one argument, a vector of the concatenated column names() of both data frames. So we can't easily pass the names of the data frame objects to it. It seems to me that there are two solutions:

  1. Manually add the desired suffix to an anonymous function.
  2. Create a wrapper function around crossing().

1. Manually add the desired suffix to an anonymous function

We can simply supply the suffix as a character vector to the anonymous .name_repair parameter, e.g. suffix = c("_df1", "_df2").

crossing(
    df1,
    df2,
    .name_repair = \(x, suffix = c("_df1", "_df2")) {
        names_to_repair <- names(which(table(x) == 2))

        x[x %in% names_to_repair] <- paste0(
            x[x %in% names_to_repair], 
            rep(
                suffix,
                each = length(unique(names_to_repair))
            )
        )
        x
    }
)
#   person    V1 V2_df1 V3_df1 V2_df2 V3_df2
#   <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
# 1 A          1      3      3      1      2
# 2 A          1      3      3      1      6
# 3 A          1      3      3      2      5
# 4 B          4      4      5      1      2
# 5 B          4      4      5      1      6
# 6 B          4      4      5      2      5
# 7 C          2      1      1      1      2
# 8 C          2      1      1      1      6
# 9 C          2      1      1      2      5

The disadvantage of this is that there is a room for error when typing the suffix, or that we might forget to change it if we change the names of the data frames.

Also note that we are checking for names which appear twice. If one of your original data frames already has broken (duplicated) names then this function will also rename those columns. But I think it would be unwise to try to do any type of join if either data frame did not have unique column names.

2. Create a wrapper function around crossing()

This might be more in the spirit of the tidyverse. Thecrossing() docs to which you linked state crossing() is a wrapper around expand_grid(). The source for expand_grid() show that it is basically a wrapper which uses map() to apply vctrs::vec_rep() to some inputs. So if we want to add another function to the call stack, there are two ways I can think of:

Using deparse(substitute())

crossing_fix_names <- function(df_1, df_2) {
    suffixes <- paste0(
        "_",
        c(deparse(substitute(df_1)), deparse(substitute(df_2)))
    )

    crossing(
        df_1,
        df_2,
        .name_repair = \(x, suffix = suffixes) {
            names_to_repair <- names(which(table(x) == 2))

            x[x %in% names_to_repair] <- paste0(
                x[x %in% names_to_repair], 
                rep(
                    suffix,
                    each = length(unique(names_to_repair))
                )
            )
            x
        }
    )
}
# Output the same as above
crossing_fix_names(df1, df2)  

The disadvantage of this is that deparse(substitute()) is ugly and can occasionally have surprising behaviour. The advantage is we do not need to remember to manually add the suffixes.

Using match.call()

crossing_fix_names2 <- function(df_1, df_2) {
    
    args  <- as.list(match.call())

    suffixes <- paste0(
        "_",
        c(
            args$df_1, 
            args$df_2
        )
    )
    
    crossing(
        df_1,
        df_2,
        .name_repair = \(x, suffix = suffixes) {
            names_to_repair <- names(which(table(x) == 2))

            x[x %in% names_to_repair] <- paste0(
                x[x %in% names_to_repair],
                rep(
                    suffix,
                    each = length(unique(names_to_repair))
                )
            )
            x
        }
    )
}

# Also the same output
crossing_fix_names2(df1, df2)

As we don't have the drawbacks of deparse(substitute()) and we don't have to manually specify the suffix, I think this is the probably the best approach.

SamR
  • 8,826
  • 3
  • 11
  • 33
  • 2
    And a thought on single byte vs multi-byte arguments, [see @ronak-shah](https://stackoverflow.com/questions/4350440/split-data-frame-string-column-into-multiple-columns) and what would we remember some Xmas hence... – Chris Dec 25 '22 at 15:55
2

test for the condition using dputs :

colnames(df1) %in% colnames(df2)
[1] FALSE FALSE  TRUE  TRUE

rename

colnames(df2) <- paste0(colnames(df2), '_df2')

then cbind

cbind(df1,df2)
  person V1 V2 V3 V2_df2 V3_df2
1      A  1  3  3      2      5
2      B  4  4  5      1      6
3      C  2  1  1      1      2

not so elegant, but usefully discernible later.

Chris
  • 1,647
  • 1
  • 18
  • 25
  • 1
    Yeah sure you can do this - and I probably would do some variation of this - but it's no fun. Although you need to do it after the join - `cbind()` is not the same as `crossing()`. – SamR Dec 25 '22 at 15:36