How to compare and filter 2 character data frames in Rstudio?

Question

I have 2 different character data frames for some titles of books like this:

TITLE	YEAR
TITLE1	2006
TITLE11	2009
TTILE 24	2010

TITLE	YEAR
TITLE12	2008
TITLE 24	2010
TTILE 1	2006

I made this code:

require(dplyr)

df1 <- data.frame(Google_Scholar_Publicaciones)

df2 <- data.frame(Publicaciones_1_)

semi_join(df1$TITLE,df2$TITLE)

But I got this error:

Error in UseMethod("semi_join") : no applicable method for 'semi_join' applied to an object of class "character"

How can I compare both of my character data frames and obtain the titles that aren't mutually included on both of them? I mean to obtain the title 11 and 1 that aren't part of both data frames in a new data set or variable.

To provide a good [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), it is helpful to provide some data. You can do this for both of your dataframes using `dput` (i.e., `dput(head(df1))` and `dput(head(df2))`), then paste the results here. — AndrewGB, Jun 19 '21 at 19:28
dput(head(df1)) structure(list(TITLE = c("La autonomía municipal y su garantía constitucional directa de protección", "El Pacto Federal como Cláusula Institucional del Estado Constitucional", "Premisas metodológicas para una investigación de derecho comparado de las....", "El principio de proporcionalidad y la política pública", "La utilización del método com...." ), YEAR = c(2005, 2008, 2008, 2002, 2016, 2011)), row.names = c(NA, 6L), class = "data.frame") — Marcela Marroquín, Jun 19 '21 at 19:33
> dput(head(df2)) structure(list(TITLE = c("Prospectiva de oferta y demanda laboral en Sonora, 2005-2020", "El mercado laboral en México: perspectiva 2000-2020", "El (Des) Empleo reciente en México, su perspectiva y sus requerimientos financieros", "\"Nota Crítica. \0342006-2012: ¿El sexenio del empleo\035?, .\"", "Perspectiva del (des)empleo en Sonora, 2000-2020", "La migración a Estados Unidos y la frontera Noreste de México." ), YEAR = c(2007, 2006, 2005, 2007, 2004, 2007)), row.names = c(NA, 6L), class = "data.frame") — Marcela Marroquín, Jun 19 '21 at 19:33
Your `semi_join` command should be `semi_join(df1, df2, by = 'TITLE')` — akrun, Jun 19 '21 at 19:34
thank you :), I just want to obtain the values that are different in both data frames, the year isn´t important at all — Marcela Marroquín, Jun 19 '21 at 19:34

score 2 · Answer 1 · answered Jun 19 '21 at 20:52

As suggested by @akrun in the comments, you want semi_join(df1, df2, by = 'TITLE').

library(dplyr, warn.conflicts = FALSE)

df1 <-
  structure(list(
    TITLE = c(
      "La autonomía municipal y su garantía constitucional directa de protección",
      "El Pacto Federal como Cláusula Institucional del Estado Constitucional",
      "Premisas metodológicas para una investigación de derecho comparado de las....",
      "El principio de proporcionalidad y la política pública",
      "La utilización del método com....",
      "La migración a Estados Unidos y la frontera Noreste de México."
    ),
    YEAR = c(2005, 2008, 2008, 2002, 2016, 2011)
  ),
  row.names = c(NA, 6L),
  class = "data.frame")

df2 <-
  structure(list(
    TITLE = c(
      "Prospectiva de oferta y demanda laboral en Sonora, 2005-2020",
      "El mercado laboral en México: perspectiva 2000-2020",
      "El (Des) Empleo reciente en México, su perspectiva y sus requerimientos financieros",
      "\"Nota Crítica. \0342006-2012: ¿El sexenio del empleo\035?, .\"",
      "Perspectiva del (des)empleo en Sonora, 2000-2020",
      "La migración a Estados Unidos y la frontera Noreste de México."
    ),
    YEAR = c(2007, 2006, 2005, 2007, 2004, 2007)
  ),
  row.names = c(NA, 6L),
  class = "data.frame")

df1 %>% 
  semi_join(df2, by = 'TITLE')
#>                                                            TITLE YEAR
#> 1 La migración a Estados Unidos y la frontera Noreste de México. 2011

^{Created on 2021-06-19 by the reprex package (v2.0.0)}

score 1 · Answer 2 · answered Jun 19 '21 at 20:27

Something like this?

set.seed(42)
x <- sample(c(LETTERS, letters), 25) # random selection of letters
y <- sample(c(LETTERS, letters), 25) 
x[ !x %in% y]  # letters in x that are not in y
#  [1] "w" "k" "A" "Y" "J" "G" "u" "z" "y" "C" "a" "n" "l" "m"
y[ !y %in% x]  # letters in y that are not in x
#  [1] "g" "d" "q" "O" "V" "D" "t" "v" "h" "i" "x" "W" "o" "F"

How to compare and filter 2 character data frames in Rstudio?

2 Answers2