Joining multiple tables with mix data

Question

My data:

data1 <- data.frame(from = c(1, 2, 13, 4),
                    to = c(4, 3, 9, 1),
                    values = c(12, 56, 67, 78)) 

data2 <- data.frame(place = c("NY", "London", "Brest", "Nantes"),
                    id = c(1, 2, 3, 4))

My results:

from to values
   1  4     12
   2  3     56
  13  9     67
   4  1     78

  place id
     NY  1
 London  2
  Brest  3
 Nantes  4

What I expect using join function from dplyr package (in a new table)

from      to      values
NY     Nantes     12
London  Brest     56
London     NY     78

What I tried:

 data3<- inner_join (data1, data2, by =c("from" = "id", "to" = "id"))
 data3

some references:
https://stat545-ubc.github.io/bit001_dplyr-cheatsheet.html
https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

A bigger example with mix data

Consider that i have 50 columns with geographical data ("places") and non geographical data (levels, values)
I doesnt wish to change the columns'order of my d.f.
I want to keep my columns names

   data1 <- data.frame(levels1 = c("name1", "name2", "name3", "name4"),
                       value1 = c(4, 3, 9, 1),
                       firstplace = c(1, 2, 13, 4),  
                       secondplace = c(1, 2, 2, 4),
                       value2  = c(78, 3000, 90, 101),
                       thirdplace =c(1, 1, 2, 4),
                       fourthplace=c(4, 4, 4, 4),
                       fifthplace=c(1, 2, 3, 4), 
                       value3 = c(12, 56, 67, 78))

   data2 <- data.frame(place = c("NY", "London", "Brest", "Nantes"),
                    id = c(1, 2, 3, 4))

A example with different names (more complex?)

I doesnt wish to change the columns'order of my d.f.
I want to keep my columns names

   data1 <- data.frame(levels1 = c("name1", "name2", "name3", "name4"),
                       value1 = c(4, 3, 9, 1),
                       shops= c(1, 2, 13, 4),  
                       after_sales_service = c(1, 2, 2, 4),
                       value2  = c(78, 3000, 90, 101),
                       provider =c(1, 1, 2, 4),
                       seller=c(4, 4, 4, 4),
                       maker=c(1, 2, 3, 4), 
                       value3 = c(12, 56, 67, 78))

   data2 <- data.frame(place = c("NY", "London", "Brest", "Nantes"),
                    id = c(1, 2, 3, 4))

Relevant post: [How to join (merge) data frames (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871) — zx8754, Feb 06 '16 at 16:42
I changed my title and I edited my question: my problem is related to the size of my table (100 columns) and the presence of data to be modified with a second table and data to keep. — Wilcar, Feb 07 '16 at 08:37

Jaap · Accepted Answer · 2016-02-07T09:07:53.820

Instead of joining, you can use data2 as a look-up table:

library(dplyr)
data1 <- data1 %>% 
  mutate(from = data2$place[match(from, data2$id)],
         to = data2$place[match(to, data2$id)]) %>%
  filter(complete.cases(.))

gives:

> data1
    from     to values
1     NY Nantes     12
2 London  Brest     56
3 Nantes     NY     78

An alternative solution with the data.table package:

library(data.table)
na.omit(setDT(data1)[, `:=` (from = data2$place[match(from, data2$id)],
                             to = data2$place[match(to, data2$id)])])

You could also do a double left_join:

data1 %>% 
  left_join(., data2, by = c("from"="id")) %>%
  left_join(., data2, by = c("to"="id")) %>%
  select(-c(1:2)) %>%
  filter(complete.cases(.))

Update 1: If you have multiple columns for which the names have to be matched, it is probably better to transform your dataframe to long form first. An example with the bigger dataset:

library(dplyr)
library(tidyr)
data1 %>%
  gather(var, val, -values) %>%
  left_join(., data2, by = c("val"="id")) %>%
  select(-3) %>%
  filter(!is.na(place)) %>% 
  spread(var, place)

which gives:

  values fifthplace firstplace fourthplace   from secondplace thirdplace     to
1     12         NY         NY      Nantes     NY          NY         NY Nantes
2     56     London     London      Nantes London      London         NY  Brest
3     67      Brest       <NA>      Nantes   <NA>      London     London   <NA>
4     78     Nantes     Nantes      Nantes Nantes      Nantes     Nantes     NY

With data.table you could do:

library(data.table)
dcast(melt(setDT(data1),
           id.vars = "values")[data2, on = c(value="id")],
      values ~ variable, value.var = "place")

giving you the same result.

Update 2: In response to the second update of the question, you can approach it as follows with dplyr / tidyr:

data1 %>%
  gather(var, val, c(firstplace,secondplace,thirdplace,fourthplace,fifthplace)) %>%
  left_join(., data2, by = c("val"="id")) %>%
  select(-val) %>%
  spread(var, place)

which gives:

  levels1 value1 value2 value3 fifthplace firstplace fourthplace secondplace thirdplace
1   name1      4     78     12         NY         NY      Nantes          NY         NY
2   name2      3   3000     56     London     London      Nantes      London         NY
3   name3      9     90     67      Brest       <NA>      Nantes      London     London
4   name4      1    101     78     Nantes     Nantes      Nantes      Nantes     Nantes

Or with data.table:

mvars <- c("firstplace","secondplace","thirdplace","fourthplace","fifthplace")
dcast(melt(setDT(data1),
           measure.vars = mvars)[data2, on = c(value="id")],
      levels1 + value1 + value2 + value3 ~ variable, value.var = "place")

which gives the same result:

   levels1 value1 value2 value3 firstplace secondplace thirdplace fourthplace fifthplace
1:   name1      4     78     12         NY          NY         NY      Nantes         NY
2:   name2      3   3000     56     London      London         NY      Nantes     London
3:   name3      9     90     67         NA      London     London      Nantes      Brest
4:   name4      1    101     78     Nantes      Nantes     Nantes      Nantes     Nantes

Update 3: If you want to work with index numbers, you can do:

# dplyr / tidyr
data1 %>%
  gather(var, val, c(3,4,6:8)) %>%
  left_join(., data2, by = c("val"="id")) %>%
  select(-val) %>%
  spread(var, place)

# data.table
dcast(melt(setDT(data1),
           measure.vars = c(3,4,6:8))[data2, on = c(value="id")],
      levels1 + value1 + value2 + value3 ~ variable, value.var = "place")

which gives (data.table output):

   levels1 value1 value2 value3  shops after_sales_service provider seller  maker
1:   name1      4     78     12     NY                  NY       NY Nantes     NY
2:   name2      3   3000     56 London              London       NY Nantes London
3:   name3      9     90     67     NA              London   London Nantes  Brest
4:   name4      1    101     78 Nantes              Nantes   Nantes Nantes Nantes

I agree, but a need to keep non-geotagged data from my data.frame. My data.frame is a mix of 50% off geotagged data (names in data2 table) and 50% of no geotagged data... — Wilcar, Feb 06 '16 at 17:53
it works great. First remark: how to use the index numbers of columns (easier to work with many columns)? 2nd remark how to preserve the order of the columns? — Wilcar, Feb 07 '16 at 08:58

akrun · Answer 2 · 2016-02-07T09:01:00.737

2

We could do this with base R

merge(merge(data1, data2, by.x='from', by.y= 'id'),
           data2, by.x='to', by.y='id')[-(1:2)]

Update

For your new dataset, we can just use match after converting the dataset to matrix without the 'values' column

temp <- as.data.frame(`dim<-`(as.character(data2$place[
          match(as.matrix(data1[-ncol(data1)]),  data2$id)]),
                        dim(data1[-ncol(data1)])))
names(temp) <- head(names(data1),-1)
cbind(data1[ncol(data1)], temp)
#  values   from     to firstplace secondplace thirdplace fourthplace fifthplace
#1     12     NY Nantes         NY          NY         NY      Nantes         NY
#2     56 London  Brest     London      London         NY      Nantes     London
#3     67   <NA>   <NA>       <NA>      London     London      Nantes      Brest
#4     78 Nantes     NY     Nantes      Nantes     Nantes      Nantes     Nantes

Update2

Based on the new update in the OP's post

i1 <- grep('place', names(data1))
d1 <- as.data.frame(`dim<-`(as.character(data2$place[
        match(as.matrix(data1[i1]), data2$id)]), 
          dim(data1[i1])), stringsAsFactors=FALSE)
d2 <- cbind(data1[-i1], setNames(d1, paste0('place', 1:ncol(d1))))
d2
#   levels1 value1 value2 value3 place1 place2 place3 place4 place5
#1   name1      4     78     12     NY     NY     NY Nantes     NY
#2   name2      3   3000     56 London London     NY Nantes London
#3   name3      9     90     67   <NA> London London Nantes  Brest
#4   name4      1    101     78 Nantes Nantes Nantes Nantes Nantes

Update3

If the column names are different, just change the 2nd step

 d2 <- cbind(data1[-i1], setNames(d1, names(data1[i1])))

edited Feb 07 '16 at 09:01

answered Feb 06 '16 at 16:13

akrun

874,273
37
540
662

I have more than 50 columns to join in *data1* with **id** in *data2*. what is the best way to do this? – Wilcar Feb 06 '16 at 16:59
@Wilcar It is not clear from your description. Are you saying that there are 50 `from/to` columns in 'data1'? – akrun Feb 06 '16 at 17:04
I have many columns to geotag from *id*. The columns called *from /to* were a toy sample... – Wilcar Feb 06 '16 at 17:17
@Wilcar I think it is better to update your post with a new example that shows the problem and the expected output. – akrun Feb 06 '16 at 17:24
ok, look at my post, I have many columns... – Wilcar Feb 06 '16 at 17:32
@Wilcar I updated the post with a `base R` method – akrun Feb 06 '16 at 17:48
I changed my title and I edited my question: my problem is related to the size of my table (100 columns) and the presence of data to be modified with a second table and data to keep. – Wilcar Feb 07 '16 at 08:39
@Wilcar I updated the post – akrun Feb 07 '16 at 08:49
it works great. First remark: your method is based on identical column names. I just posted a variant with different names. 2nd remark how to preserve the order of the columns? – Wilcar Feb 07 '16 at 08:57
@Wilcar We can just do `d2[c(1,order(as.numeric(sub('\\D+', '',names(d2)[-1])))+1L)]` – akrun Feb 07 '16 at 08:58
Yes! Sorry for updating my post. – Wilcar Feb 07 '16 at 09:07

Joining multiple tables with mix data

2 Answers2

Update

Update2

Update3