0

I have a dataframe to which I want to create a new column based on the values from another column but struggling to be able to match properly.

df1
name            code
Player 3        NA
Player 14       NA
Player 16       NA
Player 22       NA
Player 43       NA
Player 45       NA

Now I wish to fill the code column in df1 from the code column in df2 my matching on name

df2
name            id      nationality
Player 1        1       UK
Player 2        2       UK
Player 3        3       UK
Player 4        4       UK
Player 5        5       UK
Player 14       14      UK
Player 16       16      UK
Player 22       22      UK
Player 29       29      UK
Player 30       30      UK
Player 32       32      UK
Player 39       39      UK
Player 43       43      UK
Player 45       45      UK

I dont want to use merge here as df2 will be much bigger than df2 and completely separate, it would be something like; (but I cant get it correct)

df1$code = df2[match(df1$name, df2$name), 'id')
Joe
  • 795
  • 1
  • 11
  • 1
    look at merge's `all.x = TRUE` argument. – Paul Stafford Allen Jul 06 '23 at 13:51
  • 1
    `match` works for this because you are only matching on one column. `merge` works for this too, but generalizes up to matching on multiple columns. It doesn't matter that `df2` is bigger than `df1`, `merge()` will still work just fine. – Gregor Thomas Jul 06 '23 at 13:52
  • 1
    I'm marking this one as a duplicate too - please don't take that as an indicator that you should delete it. Duplicates serve as helpful pointers to the originals where thorough, updates answers live. In addition to the FAQ on joining (which works fine here), I've added the FAQ on matching. – Gregor Thomas Jul 06 '23 at 14:06

3 Answers3

2

match works for this because you are only matching on one column. merge works for this too, but generalizes up to matching on multiple columns.

It doesn't matter that df2 is bigger than df1, merge() will still work just fine as long as you don't override the default and set all = TRUE - if you do that, then you will get all the rows from df2. The default is all = FALSE and you will only get rows that appear in both data frames. Here, I set all.x = TRUE to make sure you keep all rows in df1 even if they don't have matches in df2.

Because merge is more general (working for multiple columns, letting you specify whether you want to keep only rows that occur in df1 or only rows that occur in df2 or both or all), I think it is a better solution when working with data frames. match is a great function when one (or both) of your inputs are plain vectors, not data frames.

## with dplyr
library(dplyr)
df1 |> select(-code) |>
  left_join(select(df2, -id), by = "name")
#        name nationality
# 1  Player 3          UK
# 2 Player 14          UK
# 3 Player 16          UK
# 4 Player 22          UK
# 5 Player 43          UK
# 6 Player 45          UK

## with base R
df1[["code"]] = NULL
merge(df1, df2[c("name", "nationality")], all.x = TRUE)
#        name nationality
# 1 Player 14          UK
# 2 Player 16          UK
# 3 Player 22          UK
# 4  Player 3          UK
# 5 Player 43          UK
# 6 Player 45          UK

Unfortunately merge doesn't keep the row order, but you can easily re-order after.


Using this sample data:

df1 = read.table(text = 'name|code
Player 3|NA
Player 14|NA
Player 16|NA
Player 22|NA
Player 43|NA
Player 45|NA', header = T, sep = "|")

df2 = read.table(text = 'name|id|nationality
Player 1|1|UK
Player 2|2|UK
Player 3|3|UK
Player 4|4|UK
Player 5|5|UK
Player 14|14|UK
Player 16|16|UK
Player 22|22|UK
Player 29|29|UK
Player 30|30|UK
Player 32|32|UK
Player 39|39|UK
Player 43|43|UK
Player 45|45|UK
', header = TRUE, sep = "|")
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

I think this is what you need. match returns the index in the second argument of matches from the first.

df1$code = df2$id[match(df1$name, df2$name)]
user438383
  • 5,716
  • 8
  • 28
  • 43
-1

my answer: df1$code = df2[df2$name %in% df1$name,2]

I guess %in% here is almost the same with match. you can also try which(df2$name == df1$name) to return the rows you want

Wei Meng
  • 11
  • 1
  • `%in%` is defined as `match(x, table, nomatch = 0L) > 0L`. So yes it uses `match` – Onyambu Jul 06 '23 at 13:57
  • This is risky - `%in% `doesn't care about order so it will only give the right answer if rows in `df2` happen to be in the same order as the rows in `df1` I would not recommend this as a good solution for that reason. – Gregor Thomas Jul 06 '23 at 14:03