Matching dataframes by id fails for some values

Question

I have a very large dataset with a few thousand missing values, which looks something like the below:

df1:
                            doi journal year
1  10.1037/0002-9432.76.1.13    <NA>   NA
2  10.1037/0002-9432.76.1.13    <NA>   NA
3  10.1037/0002-9432.76.1.13    <NA>   NA
4 10.1037/0003-066X.60.8.750    <NA>   NA
5 10.1037/0003-066X.60.8.750    <NA>   NA
6 10.1037/0003-066X.60.8.750    <NA>   NA

I have another dataframe which contains all missing journal names and years:

df2:
                          doi year                             journal
17  10.1037/0002-9432.76.1.13 2006 American Journal of Orthopsychiatry
18  10.1037/0002-9432.76.1.13 2006 American Journal of Orthopsychiatry
19  10.1037/0002-9432.76.1.13 2006 American Journal of Orthopsychiatry
31 10.1037/0003-066x.60.8.750 2005               American Psychologist
32 10.1037/0003-066x.60.8.750 2005               American Psychologist
33 10.1037/0003-066x.60.8.750 2005               American Psychologist

Yet when I try to match the two by their doi value

df1$year[is.na(df1$year)] <- df2$year[match(df1$doi[is.na(df1$year)], df2$doi)]
df1$journal[is.na(df1$journal)] <- df2$journal[match(df1$doi[is.na(df1$journal)], df2$doi)]

this only works for some:

Result:
                             doi                             journal year
1  10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
2  10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
3  10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
4 10.1037/0003-066X.60.8.750                                <NA>   NA
5 10.1037/0003-066X.60.8.750                                <NA>   NA
6 10.1037/0003-066X.60.8.750                                <NA>   NA

I have tried different ways of matching the dataframes (like this or this), as well as trimming the white-space when loading the dataframes, but no success. "doi" and "journal" are character vectors, and "year" is an integer. Very thankful if anyone has some insight.

R is case sensitive so your problem is within the way the doi is displayed in the values the produce the NA's you have a capital X in your doi in one df but not in the other — Sarina, Apr 28 '17 at 11:15
Ading to @Sarina's comment: Using `tolower` like this `match(df1$doi[is.na(df1$year)], tolower(df2$doi))` could solve the problem. — Jaap, Apr 28 '17 at 11:39

Uwe · Answer 1 · 2017-05-05T08:34:27.997

The OP mentioned that he has a very large dataset with a few thousand missing values.

This is why I felt challenged to suggest a data.table solution using update on join instead of base R match() although the actual problem has been solved already by using tolower().

library(data.table)

#prepare doi
setDT(df1)[, doi := tolower(doi)]
setDT(df2)[, doi := tolower(doi)]

#join
df1[unique(df2), on = "doi", `:=`(year = i.year, journal = i.journal)]

df1
#                          doi                             journal year
#1:  10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
#2:  10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
#3:  10.1037/0002-9432.76.1.13 American Journal of Orthopsychiatry 2006
#4: 10.1037/0003-066X.60.8.750               American Psychologist 2005
#5: 10.1037/0003-066X.60.8.750               American Psychologist 2005
#6: 10.1037/0003-066X.60.8.750               American Psychologist 2005

Note that this will replace all values of year and journal in df1 by the values given in df2 for matching doi irrespective if they are NA or not.

There is a variant which does only replace values if they are NA:

df1[unique(df2), on = "doi", 
    `:=`(year = replace(year, is.na(year), i.year), 
         journal = replace(journal, is.na(journal), i.journal))]

Benchmark

For a speed comparison between the three methods, df1 has been appended to itself so that it has about 100'000 rows. As df1 is updated in itself, each benchmark run has to start with a fresh copy. The copy operation is included in the benchmark as well.

microbenchmark::microbenchmark(
  copy = df1 <- copy(df1_orig),
  OP_match = {
    df1 <- copy(df1_orig)
    df1$year[is.na(df1$year)] <- df2$year[match(df1$doi[is.na(df1$year)], df2$doi)]
    df1$journal[is.na(df1$journal)] <- df2$journal[match(df1$doi[is.na(df1$journal)], df2$doi)]
  },
  update_on_join = {
    df1 <- copy(df1_orig)
    df1[unique(df2), on = "doi", `:=`(year = i.year, journal = i.journal)]
  },
  replace_on_join = {
    df1 <- copy(df1_orig)
    df1[unique(df2), on = "doi", 
        `:=`(year = replace(year, is.na(year), i.year), 
             journal = replace(journal, is.na(journal), i.journal))]
  },
  times = 100L
)

The results show that _update_on_join_ is nearly three times faster than base R using match() for this case:

Unit: microseconds
            expr       min        lq      mean    median        uq        max neval
            copy   760.449   978.691  1129.290  1071.388  1202.974   2085.383   100
        OP_match 12376.362 14532.352 16215.333 15295.821 17497.497  35352.941   100
  update_on_join  5101.879  5585.939  6136.479  5914.435  6416.240   9272.643   100
 replace_on_join  7998.306  8729.303 11822.586  9367.416  9802.767 227385.521   100

Data

library(data.table)
df1 <- fread(
  "rn                    doi journal year
1  10.1037/0002-9432.76.1.13    <NA>   NA
2  10.1037/0002-9432.76.1.13    <NA>   NA
3  10.1037/0002-9432.76.1.13    <NA>   NA
4  10.1037/0003-066X.60.8.750   <NA>   NA
5  10.1037/0003-066X.60.8.750   <NA>   NA
6  10.1037/0003-066X.60.8.750   <NA>   NA",
  drop = 1, na.strings = c("NA", "<NA>"))
df1[, journal := as.character(journal)]
df1[, year := as.integer(year)]

df2 <- fread(
  "rn,                       doi, year,                             journal
  17,  10.1037/0002-9432.76.1.13, 2006, American Journal of Orthopsychiatry
  18,  10.1037/0002-9432.76.1.13, 2006, American Journal of Orthopsychiatry
  19,  10.1037/0002-9432.76.1.13, 2006, American Journal of Orthopsychiatry
  31, 10.1037/0003-066x.60.8.750, 2005,               American Psychologist
  32, 10.1037/0003-066x.60.8.750, 2005,               American Psychologist
  33, 10.1037/0003-066x.60.8.750, 2005,               American Psychologist",
  drop = 1)

#prepare doi
df1[, doi := tolower(doi)]
df2[, doi := tolower(doi)]

#create benchmark data

df1_orig <- copy(df1)
df2_orig <- copy(df2)

for (i in seq_len(14L)) df1_orig <- rbind(df1_orig, df1_orig)
nrow(df1_orig)

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

Much obliged @Sarina and @Jaap, your comments were correct and tolowersolved the issue for me.

R is case sensitive so your problem is within the way the doi is displayed in the values the produce the NA's you have a capital X in your doi in one df but not in the other – Sarina

Ading to @Sarina's comment: Using tolower like this match(df1$doi[is.na(df1$year)], tolower(df2$doi)) could solve the problem. – Jaap

Matching dataframes by id fails for some values

2 Answers2

Benchmark

Data