Comparing 2 dataframes in R: Searching a string from df1$V2 in df2$V2 and returning string in df2$V1

Question

I am trying to compare 2 dataframes in R:

Keggs <- c("K001", "K002", "K003", "K004", "K005", "K006", "K007", "K008")
names <- c("Acaryochloris", "Proteobacteria", "Parvibaculum", "Alphaproteobacteria", "Rhodospirillum", "Magnetospirillum", "Coraliomargarita", "Bacteria")
family <- c("Proteos", "Cyanobacteria", "Rhizo", "Nostocales", "Bacteroidetes")
species <- c("Alphaproteobacteria", "Purrsia", "Parvibaculum", "Chico", "Rhodospirillum")
res <- data.frame(Keggs, names)
result <- data.frame(family, species)

Now, what I would like to do is to compare each string in the result$species with the res$names.

If there is a match, I would like for it to return the string that is in result$family of that same row, as well as the string that is in res$Keggs, as a separate dataframe.

Then end result would look like this:

> df3
Keggs family
K003  Rhizo
K004  Proteos
K005  Bacteroidetes

I have searched on how to compare data.frames in R and the closest I have found is this: compare df1 column 1 to all columns in df2 returning the index of df2

But this returns T/F and the res df is 2 columns.

In my searches I have ran into using the match() and merge() functions in base R, however; I am working with a "res" df that is 11,000,000 rows and my "result" df is less than 1,000 rows. In the match documentation it states: match(x, table, ...) and under table: "long vectors are not supported" So, I don't think that the match() or merge() (due to the sheer size of my actual df's) approach is the most elegant. I have tried a loop, but I am limited in my loop skills and threw in the towel.

I would be incredibly grateful for any insights into this conundrum.

Thank you in advance, Purrsia

Have you actually tried the `match` call? 1e7 may seem big, but I think you may be mis-understanding what a "long vector" is to R. Type in `news()` on the console, scroll down to "LONG VECTORS", and read. — r2evans, Feb 02 '17 at 05:58
Have you tried `merge(res, result, by.x="names", by.y="species")`? — r2evans, Feb 02 '17 at 06:14
r2evens: First, thank you for the news(). I did not know about this. Great tool to have. I did read: 2^31. So, I am well w/in my limits. My apologies, I did try the following command: `matched <- data.frame(kegg = res$Keggs, family=result[match(result$species, res$V7), 2])`. And originally got an error due to differing sizes of the number of rows. — Purrsia, Feb 02 '17 at 06:28

score 0 · Answer 1 · answered Feb 02 '17 at 06:05

0

you can try tidyverse functions as:

df3 <- res %>% 
  inner_join(result, by = c("names" = "species")) %>%
  select(Keggs, family)

which gives

  Keggs        family
1  K003         Rhizo
2  K004       Proteos
3  K005 Bacteroidetes

answered Feb 02 '17 at 06:05

Aramis7d

2,444
19
25

At first I kept getting an error that the %>% function was not found, but after doing a search on this site, I learned I have to attach the `dplyr` package. It worked beautifully. Thank you, Aramis. – Purrsia Feb 02 '17 at 23:35
:) the `piping` operator `%` is mainly from the `magrittr` package, but `tidyverse` conveniently includes both `dplyr` and basic piping operators. – Aramis7d Feb 03 '17 at 04:25
That's great info. to have learned. Many thanks! :) – Purrsia Feb 04 '17 at 05:42

score 0 · Answer 2 · answered Feb 02 '17 at 06:13

0

We can use data.table

library(data.table)
na.omit(setDT(res)[result, on = c("names" = "species")])[, names := NULL][]
#   Keggs        family
#1:  K004       Proteos
#2:  K003         Rhizo
#3:  K005 Bacteroidetes

answered Feb 02 '17 at 06:13

akrun

874,273
37
540
662

1

`na.omit` is a very inefficient function, you could just specify `, nomatch = 0L` – David Arenburg Feb 02 '17 at 08:10

Comparing 2 dataframes in R: Searching a string from df1$V2 in df2$V2 and returning string in df2$V1

2 Answers2