0

I am aware this is a common question. However the common issue when combining dataframes in R is that there are overlapping column, and that the columns need functions.

This is not the case for my issue. I currently have 2 dataframes in R. They have a completely distinct set of columns except for an ID column common to both.

There is a primary and secondary dataset (this is important because not all of the values in the primary are present in the secondary)

I would like to add 2 columns from the secondary to the primary dataframe.

So the current method I use is theo iterate over all of the rows of the primary. Check if it is present in the secondary and if it is, add the rows by

primary$newCol1 <- Secondary$newCol1[Matching stuff here]

The main issue is that the datasets are massive, around 9 million rows each and a good deal of columns. Thus the process is extremely slow. I believe I have a correct solution. But the runtime is impossible in terms of days.

A quick example would be

rowC = nrow(dataframe1)
for(i in 1:rowC){
matchedRow = Dataframe1$id[i] == dataframe2$id
if dataframe1$id[i] %in% dataframe2$id{
dataframe1$newcol[i] <- dataframe2$newcol[matchedRow]
}
else{dataframe1$newcol[i]<- 'Unknown'}

Any help would be greatly appreciated

Mathias711
  • 6,568
  • 4
  • 41
  • 58
user1874538
  • 262
  • 2
  • 15

1 Answers1

1

Try the merge function.

myResult = merge(dataframe1, dataframe2)

You'll sort everything out after looking at the help page, but know that you may specify which columns to join on (in this case that would be the common 'id' column) and you can specify outer joins. In your case, you may consider subsetting the number of columns in the dataframe as follows:

myResult = merge(dataframe1, dataframe2[, c(columnIwant, anotherColumn)])
PirateGrunt
  • 404
  • 4
  • 11
  • Hi, thanks for the quick reply. My concern with merge is the time complexity of the dataframe size. See the SO link benchmark below http://stackoverflow.com/questions/4322219/whats-the-fastest-way-to-merge-join-data-frames-in-r – user1874538 Jun 09 '14 at 15:15
  • I've been seeing huge speedups using `inner_join()` and `left_join()` from the `dplyr` package. It's anywhere from 10-1000x than `merge()`. – rrs Jun 09 '14 at 18:11