0

I have a dataframe that consists of ~6000 columns. Each column contains a country name that represents a node in a graph. Countries within one row are connected via an edge.

This looks like this:

df <- data.frame(ID = c(1,2,3), 
                  Country_1 = c("Germany", "Russia", "Germany"),
                  Country_2 = c(NA, "Germany", NA),
                  Country_n = c("China", "China", "China"))
ID Country_1 Country_2 ... Country_n 
1  Germany   NA        ... China     
2  Russia    Germany   ... China     
3  Germany   NA        ... China     
.
.
.

Aspired Outcome

I want to create new columns that contain the interactions between countries. NAs should be ignored.

ID  Ctr_Int_1       Ctr_Int_2    ...   Ctr_Int_n 
1  Germany-China        NA       ...      NA
2  Russia-Germany   Russia-China ...  Germany-China   
3  Germany-China        NA       ...      NA  
.
.
.

Code

I can do this step by step [1] but NA are not ignored and with increasing numbers of columns it's not really feasible anymore.

library(tidyr)
library(dplyr)

# step by step
df <- df %>% unite(CountryInt_1, Country_1, Country_2, sep = "-", remove = FALSE)
df <- df %>% unite(CountryInt_2, Country_1, Country_n, sep = "-", remove = FALSE)
df <- df %>% unite(CountryInt_3, Country_2, Country_n, sep = "-", remove = FALSE)

# remove additional columns
country_names <- paste0("Country_", 1:3)

`%ni%` <- Negate(`%in%`)
df <- subset(df,select = names(df) %ni% country_names)
ID  Ctr_Int_1       Ctr_Int_2     ...    Ctr_Int_n 
1  Germany-China    Germany-NA    ...    China-NA
2  Russia-Germany   Russia-China  ...    Germany-China   
3  Germany-NA       Germany-China ...    China-NA  
.
.
.

I assume there has to be a solution to this issue or a similar issue as it shouldn't really be unheard of but I wasn't able to find it. I guess there should be an approach using base::apply and/or something similar to what has been done here [2 and here 3] - but I'm not that familiar with data.table and wasn't able to implement it.

If someone could point me in the right direct that would certainly help.

Edit: Thanks to @NotThatKindOdr the issue with the NAs is resolved, however, the more pressing issue remains as it's not feasible to create the Country-to-Country combinations manually each time.

Seb
  • 179
  • 1
  • 8
  • 1
    Are these the edges of your nodes? – Edward May 21 '20 at 01:00
  • Sorry for the late reply! Yes, each country represents a node. The combination represent a single interaction between these countries. I didn't elaborate too much on the background because it would likely confuse more than it would help. However, the idea behind creating the country-country combinations is the option to create graphs that capture the strength of the connection regarding another item like the specific "research area" the publication is about as in my case. – Seb May 23 '20 at 11:22
  • 1
    Ok. But when you said 'nodes' it reminded me of network graphs. Are you familiar with the igraph package? – Edward May 23 '20 at 23:24
  • Yes, I know about it but I'm not as familiar as I'd like to be. From my understanding, it offers a lot on network analysis but less on the data wrangling part that necessarily has to happen before ... at least if you have raw data. – Seb May 24 '20 at 16:11

2 Answers2

3

An option using data.table:

library(data.table)
dcast(
    melt(setDT(df), id.vars="ID", na.rm=TRUE)[, 
        combn(value, 2L, function(x) paste(x, collapse="-")), ID][,
            ri := paste0("Ctr_Int_", rowid(ID))],
    ID ~ ri, value.var="V1")

output:

   ID      Ctr_Int_1    Ctr_Int_2     Ctr_Int_3
1:  1  Germany-China         <NA>          <NA>
2:  2 Russia-Germany Russia-China Germany-China
3:  3  Germany-China         <NA>          <NA>
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
0

It won't ignore the NA but it will replace any combined NA with NA itself

df %>% mutate_all(~ifelse(str_detect(., "NA"), NA, .))
NotThatKindODr
  • 729
  • 4
  • 14
  • Thanks! That's already very helpful! Any take on the issue with the combinations? – Seb May 20 '20 at 15:08
  • 1
    As in why you have Germany-China in the 3rd row, 2nd column instead of 1st column? I honestly would probably `pivot_longer` to bring all interactions into the same column and just remove the NA. It depends on what your use case is though. – NotThatKindODr May 20 '20 at 15:22
  • No, about how to simplify the creation of the combinations. I have ~6000 columns so I cannot do it manually as I have up until now. – Seb May 20 '20 at 15:49
  • 1
    I'll have to think about it, I'll try to get you something later today – NotThatKindODr May 20 '20 at 16:06