1

I would like to understand what does the following code actually do. My intention is to populate the Pop column in df2 with the data in "Pop" column of df1, matching the rows by column "ID".

df2$Pop <- df1[df1$ID == df2$ID,]$Pop

It seems straight forward if the rows are not ordered (it can just look for the one that matches IDs), but what if one data frame is bigger than the other (has more rows)? Does the order of comarison matter? I am not sure what to expect from the previous line of code. Does it work just like merge (if in df1 I had only ID and Pop columns)? If so, why the two versions (advantages/disadvantages)?

df2 <- merge(df2, df1, by = "ID", all = FALSE, sort=FALSE)

By testing the two versions in data frames with different number of rows (df1 with 100.000 rows and df2 with 98.530), where df1 only has ID and Pop columns but df2 has over 4000 columns, the first version gives me a results instantaneously, while the merge version takes about 8 seconds to run. I am new to R so I don't even know how to test the outputs and check if they are the same, but should they be?

MeloMCR
  • 402
  • 1
  • 6
  • 16

1 Answers1

0

Comparison operators, == in this case, return a logical vector (A vector with TRUE and FALSE as entries) -regardless of whether its in a data frame index or not. It takes the two objects, and accesses (position by position) whether the values in those positions are equal. If they are, then it returns TRUE in a new object at the same position if not, then FALSE. Subsequently, passing this logical vector to df1 via the square brackets [], leads df1 to be subsetted, returning only the values associated with the positions where the logical vector is TRUE. Note that df1$Pop is a vector. df1 is a data frame.

By way of example

V1 <- c(1,2,3)
V2 <- c(1,3,2)
V1==V2
## TRUE FALSE FALSE
V1[V2==V1]
## 1
V1[c(TRUE,FALSE,FALSE)]
## 1

In terms of your example, all you are doing is subsetting the data frame, returning only the rows where the IDs are the same, grabbing the $Pop row and putting it in df2$Pop.

This is different from merge. Merge takes two data.frames, and to do anything meaningful, the data frames should have at least two columns each. It then looks up the rows of one data frame (using the by column) and pastes them next to the rows of another. In SQL speak, it performs JOIN.

For example:

df1<- data.frame(ID=1:3,V1=letters[1:3])
df1
##  ID V1
##1  1  a
##2  2  b
##3  3  c
df2<- data.frame(ID=c(1,3),V2=letters[3:4])
df2
##  ID V2
##1  1  c
##2  3  d
merge(df1,df2,by="ID")
##  ID V1 V2
##1  1  a  c
##2  3  c  d
df1[df1$ID==df2$ID,]$V1
##[1] a

In general, to test the results of the two expressions, take the results of the expressions and use the equality comparison operator:

V1 == V2
## TRUE FALSE FALSE

Or use all() to get a single result.

all(V1==V2)
## FALSE

There are other ways to test if objects are the same. See this stackoverflow answer.

Community
  • 1
  • 1
mgriebe
  • 908
  • 5
  • 8