-1

I am trying to compare 2 dataframes in R:

Keggs <- c("K001", "K002", "K003", "K004", "K005", "K006", "K007", "K008")
names <- c("Acaryochloris", "Proteobacteria", "Parvibaculum", "Alphaproteobacteria", "Rhodospirillum", "Magnetospirillum", "Coraliomargarita", "Bacteria")
family <- c("Proteos", "Cyanobacteria", "Rhizo", "Nostocales", "Bacteroidetes")
species <- c("Alphaproteobacteria", "Purrsia", "Parvibaculum", "Chico", "Rhodospirillum")
res <- data.frame(Keggs, names)
result <- data.frame(family, species) 

Now, what I would like to do is to compare each string in the result$species with the res$names.

If there is a match, I would like for it to return the string that is in result$family of that same row, as well as the string that is in res$Keggs, as a separate dataframe.

Then end result would look like this:

> df3
Keggs family
K003  Rhizo
K004  Proteos
K005  Bacteroidetes

I have searched on how to compare data.frames in R and the closest I have found is this: compare df1 column 1 to all columns in df2 returning the index of df2

But this returns T/F and the res df is 2 columns.

In my searches I have ran into using the match() and merge() functions in base R, however; I am working with a "res" df that is 11,000,000 rows and my "result" df is less than 1,000 rows. In the match documentation it states: match(x, table, ...) and under table: "long vectors are not supported" So, I don't think that the match() or merge() (due to the sheer size of my actual df's) approach is the most elegant. I have tried a loop, but I am limited in my loop skills and threw in the towel.

I would be incredibly grateful for any insights into this conundrum.

Thank you in advance, Purrsia

r2evans
  • 141,215
  • 6
  • 77
  • 149
Purrsia
  • 712
  • 5
  • 18
  • Have you actually tried the `match` call? 1e7 may seem big, but I think you may be mis-understanding what a "long vector" is to R. Type in `news()` on the console, scroll down to "LONG VECTORS", and read. – r2evans Feb 02 '17 at 05:58
  • Have you tried `merge(res, result, by.x="names", by.y="species")`? – r2evans Feb 02 '17 at 06:14
  • r2evens: First, thank you for the news(). I did not know about this. Great tool to have. I did read: 2^31. So, I am well w/in my limits. My apologies, I did try the following command: `matched <- data.frame(kegg = res$Keggs, family=result[match(result$species, res$V7), 2])`. And originally got an error due to differing sizes of the number of rows. – Purrsia Feb 02 '17 at 06:28

2 Answers2

0

you can try tidyverse functions as:

df3 <- res %>% 
  inner_join(result, by = c("names" = "species")) %>%
  select(Keggs, family)

which gives

  Keggs        family
1  K003         Rhizo
2  K004       Proteos
3  K005 Bacteroidetes
Aramis7d
  • 2,444
  • 19
  • 25
  • At first I kept getting an error that the %>% function was not found, but after doing a search on this site, I learned I have to attach the `dplyr` package. It worked beautifully. Thank you, Aramis. – Purrsia Feb 02 '17 at 23:35
  • :) the `piping` operator `%` is mainly from the `magrittr` package, but `tidyverse` conveniently includes both `dplyr` and basic piping operators. – Aramis7d Feb 03 '17 at 04:25
  • That's great info. to have learned. Many thanks! :) – Purrsia Feb 04 '17 at 05:42
0

We can use data.table

library(data.table)
na.omit(setDT(res)[result, on = c("names" = "species")])[, names := NULL][]
#   Keggs        family
#1:  K004       Proteos
#2:  K003         Rhizo
#3:  K005 Bacteroidetes
akrun
  • 874,273
  • 37
  • 540
  • 662