0

I have two dataframes (df1, df2).

df1 <- data.frame(term = c("A", "B", "C", "D", "E", "F"))

df2 <- data.frame(term = c("C", "F", "G"), freq = c(7, 3, 5))

In df1, I want to add a column ("freq") based on the values of "freq" in df2. So if the term in df1 and the term in df2 match, the count ("freq") of this term should be added to df1. Else it should be "0" (zero).

How can I do it, so that the processing time is as small as possible? Is there a way how to do it with dplyr? I cannot figure it out!!!

feder80
  • 1,195
  • 3
  • 13
  • 34

1 Answers1

1

If we need a faster option, a data.table join can be used along with assigning (:=) the NA values to 0 in place.

library(data.table)
setDT(df2)[df1, on = "term"][is.na(freq), freq := 0][]

Or to avoid copies, as @Arun mentioned, create a 'freq' column in 'df1' and then join on 'term' replace the 'freq' with the corresponding 'i.freq' values.

setDT(df1)[, freq := 0][df2, freq := i.freq, on = "term"]

Or use left_join

library(dplyr)
left_join(df1, df2, by = 'term') %>%
       mutate(freq = replace(freq, is.na(freq), 0)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I'd do: `setDT(df1)[, freq := 0][df2, freq := i.freq, on = "term"]` which avoids unnecessary copies. – Arun May 18 '16 at 08:36