4

I have large dataframe called df with some ID's.

I have another dataframe (id_list) with a set of matching ID's and its associated features for each ID. The ID are not sequentally ordered in both dataframes.

Effectively i would like to look up from the larger dataframe df to the id_list and add two columns namely Display and Type to the current dataframe df.

There are numerous confusing examples. What could be the most effective way of doing this. I tried using match() , %in% and failed miserably.

Here is a reproducible example.

df <- data.frame(Feats = matrix(rnorm(20), nrow = 20, ncol = 5), ID = sample.int(10, 10))

id_list <- data.frame(ID = sample.int(10,10),
           Display = sample(c('clear', 'blur'), 20, replace = TRUE),
           Type = sample(c('red', 'green', 'blue', 'indigo', 'yellow'), 20, replace = TRUE))

           Feats.1     Feats.2     Feats.3     Feats.4     Feats.5 ID
1   3.14944573 -0.52285062  3.14944573 -0.52285062  3.14944573  2
2  -0.41096007  0.38256691 -0.41096007  0.38256691 -0.41096007  1
3   0.03629351 -0.02514005  0.03629351 -0.02514005  0.03629351  7
4   0.91257290  1.35590761  0.91257290  1.35590761  0.91257290  5
5  -0.26927311 -2.10213773 -0.26927311 -2.10213773 -0.26927311  3
6   3.14944573 -0.52285062  3.14944573 -0.52285062  3.14944573  4
7  -0.41096007  0.38256691 -0.41096007  0.38256691 -0.41096007 10
8   0.03629351 -0.02514005  0.03629351 -0.02514005  0.03629351  6
9   0.91257290  1.35590761  0.91257290  1.35590761  0.91257290  8
10 -0.26927311 -2.10213773 -0.26927311 -2.10213773 -0.26927311  9

  ID Display   Type
1   6   clear indigo
2   1    blur   blue
3   7   clear    red
4   4   clear    red
5   3    blur    red
6  10   clear yellow
7   2   clear   blue
8   8    blur  green
9   5   clear   blue
10  9   clear  green

The resulting end df should be of size [20 x 8].

Vinícius Félix
  • 8,448
  • 6
  • 16
  • 32
der_radler
  • 549
  • 1
  • 6
  • 17
  • 3
    you're looking for `merge` in base R or `left_join` from `dplyr` package – moodymudskipper Jul 13 '18 at 15:14
  • Possible duplicate of https://stackoverflow.com/q/1299871/3358272 – r2evans Jul 13 '18 at 15:16
  • this explains you what joins are (the question and answers don't include any SQL so don't be intimidated by the title): https://stackoverflow.com/questions/11739324/what-is-a-left-join-in-postgresql – moodymudskipper Jul 13 '18 at 15:30
  • Does this answer your question? [How to join (merge) data frames (inner, outer, left, right)](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – camille Oct 01 '21 at 17:31

2 Answers2

3

You can use merge from base R or left_join from dplyr to do this pretty easily. (There's also data.table::merge, which maybe someone else can give an answer with.) You probably want to take steps to ensure that you don't lose any data if there's an entry in your data frame that doesn't have a corresponding ID in the lookup. If that's not the case, you can change all.x to false or null in merge, or switch from left_join to inner_join. To illustrate, I added a dummy row to the data with an ID that doesn't exist in the lookup table.

df <- data.frame(Feats = matrix(rnorm(10), nrow = 5, ncol = 5), ID = sample.int(10, 10))
dummy <- df[1, ]
dummy$ID <- 12
df <- rbind(dummy, df)

id_list <- data.frame(ID = sample.int(10,10),
                      Display = sample(c('clear', 'blur'), 10, replace = TRUE),
                      Type = sample(c('red', 'green', 'blue', 'indigo', 'yellow'), 10, replace = TRUE))

With merge, you set either by as the column name from both data frames to join by, or by.x and by.y if they have different names. all.x = T will keep all observations in the first data frame even if they don't match an observation in the second data frame.

merged1 <- merge(df, id_list, by = "ID", sort = F, all.x = T)
merged1
#>    ID     Feats.1    Feats.2     Feats.3    Feats.4     Feats.5 Display
#> 1  10 -1.44053344  1.0086988 -1.44053344  1.0086988 -1.44053344   clear
#> 2   5  0.99220217 -0.3125813  0.99220217 -0.3125813  0.99220217   clear
#> 3   2  1.03881289  1.1277627  1.03881289  1.1277627  1.03881289   clear
#> 4   7 -0.01678186 -0.1519029 -0.01678186 -0.1519029 -0.01678186   clear
#> 5   4  0.07130125  1.1715833  0.07130125  1.1715833  0.07130125   clear
#> 6   6 -1.44053344  1.0086988 -1.44053344  1.0086988 -1.44053344   clear
#> 7   8  0.99220217 -0.3125813  0.99220217 -0.3125813  0.99220217    blur
#> 8   3  1.03881289  1.1277627  1.03881289  1.1277627  1.03881289   clear
#> 9   1 -0.01678186 -0.1519029 -0.01678186 -0.1519029 -0.01678186   clear
#> 10  9  0.07130125  1.1715833  0.07130125  1.1715833  0.07130125   clear
#> 11 12 -1.44053344  1.0086988 -1.44053344  1.0086988 -1.44053344    <NA>
#>      Type
#> 1  indigo
#> 2  yellow
#> 3    blue
#> 4  indigo
#> 5  yellow
#> 6  indigo
#> 7   green
#> 8     red
#> 9     red
#> 10   blue
#> 11   <NA>

dplyr::left_join keeps all observations from the first data frame and merges in any matching ones from the second.

joined <- dplyr::left_join(df, id_list, by = "ID")
head(joined)
#>       Feats.1    Feats.2     Feats.3    Feats.4     Feats.5 ID Display
#> 1 -1.44053344  1.0086988 -1.44053344  1.0086988 -1.44053344 12    <NA>
#> 2 -1.44053344  1.0086988 -1.44053344  1.0086988 -1.44053344 10   clear
#> 3  0.99220217 -0.3125813  0.99220217 -0.3125813  0.99220217  5   clear
#> 4  1.03881289  1.1277627  1.03881289  1.1277627  1.03881289  2   clear
#> 5 -0.01678186 -0.1519029 -0.01678186 -0.1519029 -0.01678186  7   clear
#> 6  0.07130125  1.1715833  0.07130125  1.1715833  0.07130125  4   clear
#>     Type
#> 1   <NA>
#> 2 indigo
#> 3 yellow
#> 4   blue
#> 5 indigo
#> 6 yellow

Created on 2018-07-13 by the reprex package (v0.2.0).

camille
  • 16,432
  • 18
  • 38
  • 60
  • 1
    It works on this, but my actual `df ` and `id_list` has multiple values for the same ID field. For eg. ID == 2 has 100 rows. So what this does is, it joins both the df's and results in a new df of double the original df. Sry if i didnt convey this correctly. – der_radler Jul 16 '18 at 09:44
  • Yeah, that wasn't in the original question. If you have the same ID multiple times in both the data and the ID lookup, how do you plan on differentiating between them? – camille Jul 16 '18 at 14:34
0

If you can make sure that both data frames df and id_list have same ID's (only in different order) then you could try out the following:

# define new data frame
orderd_id_list <- data.frame()

# loop over rows of df (get new ID each round)
for (i in seq(nrow(df))) {
  # find the row in id_list where the ID "id_list$ID" is identical to
  # current ID in df for this round "df$ID[i]"
  new_row <- id_list[id_list$ID == df$ID[i],]
  # add new row to orderd_id_list
  orderd_id_list <- rbind(orderd_id_list, new_row)
}

# merge (add columns) Display and Type columns of new orderd data frame with df
merged_df <- cbind(Display = orderd_id_list$Display, Type = orderd_id_list$Type, df)

Using for-loops in R isn't the optimal thing to do, but if your data frame isn't too large, then it would be ok.

Basically you create a new ordered (ordered based on df$ID) copy of id_list and then merge it with df.

I hope that could help you :)

Ferand Dalatieh
  • 313
  • 1
  • 4
  • 14
  • I don't think there's a compelling reason to use for loops, since R is designed to do vector operations and functions to do this exist in base R (`merge`). What fail-safe is there for cases where an ID is in one data frame but not the other? – camille Jul 13 '18 at 18:34
  • Yes you are right. There isn't a spesific reason to use a for-loop in R. I Just wanted to write a quick suggestion as an answer. – Ferand Dalatieh Jul 17 '18 at 22:28