-1

Possible Duplicate:
In R, how do I subset a data.frame by values from another data.frame?

I have two data.frames. The first (df1) is a single column of 100 entries with header - "names". The second (df2) is a dataframe containing hundreds of columns of metadata for tens of thousands of entries. The first column of df2 also has the header "names".

I simply want to select all the metadata in df2 by the subset of names found in df1.

Please help this novice R user. Thank you!

Community
  • 1
  • 1
Paul Kadota
  • 41
  • 1
  • 1
  • 2
  • 1
    Welcome to SO. I'd suggest reading any number of the R FAQs on the internet. I don't say this in a mean way, but they have tons of great information that will answer questions for you before they come up. however, you would do this like: `df2[df2$names %in% df1$names,]` using the function `[`. – Justin Dec 06 '12 at 21:04
  • You also appear to be a novice at StackOverflow, welcome! You asked a previous (nearly identical) question that received a highly upvoted answer that you then never responded to. (For example, by commenting or clicking the check mark to accept an answer as correct.) You can edit questions to clarify/improve them, which is preferred over asking a whole new question. – joran Dec 06 '12 at 21:06
  • Thanks for the great response! I scoured the internet and this site for before posting my question. My limited understanding of programming languages, prevented me from seeing the connection between this and my previous question. I want to learn R, but it is my first programming language. Do you have any recommendations on the R help books. – Paul Kadota Dec 06 '12 at 22:04
  • The reason the two questions seem identical to us, but not to you, is that rather than provide a small, reproducible example that we can run ourselves, you simply describe vaguely your data and intentions. See [here](http://stackoverflow.com/q/5963269/324364) for how to write better questions. – joran Dec 06 '12 at 22:16

1 Answers1

1

You can use data.frame with %in% but it can be slow if you have many thousands of names to look up.

I would recommend using data.table because it sorts the index columns and can do an almost instantaneous database join even with millions of records. Read the data.table documentation for more information.

Suppose you have a big data.frame and little data.frame:

library(data.table)
big <- data.frame(names=1:5, data=1:5)
small <- data.frame(names=c(1, 3, 6))

Make them into data.table objects and set the key column to be names.

big <- data.table(big, key='names')
small <- data.table(small, key='names')

Now perform the join. [] in data.table allows a data.table to be indexed by the key column of another data.table. In this case, we return the rows of big that are also in small, and there will be missing data if there are names in small but not in big.

big[small]
#    names data
# 1:     1    1
# 2:     3    3
# 3:     6   NA
anoop
  • 81
  • 1