1

I have found questions that are similar to what I want to do but not the exact same.

I am working in R. I have two dataframes I want to combine. The only issue is that there are more observations in one dataframe than the other. (The data I have is proprietary so I'll make up some data to show you.) Let's say dataframe A has 450 observations and dataframe B has 500 observations.

Both dataframes have a variable that identifies a unique person. Lets say it’s a social security number. So there exist people who are in both dataframe A and dataframe B. But there are some people who exist in one and not the other. I want to keep the rows of people who are in both dataframes and eliminate the people who are in only one dataframe and not the other. To illustrate with fake data on a smaller scale...

    Dataframe A

        SSID             Age        Wage
    [1]12345             23         45645
    [2]15461             45         534688
    [3]12458             12         475412
    [4]68741             63         124
    [5]36987             91         458746

    Dataframe B

        SSID             Education        Race
    [1]12345             2                8
    [2]15461             3                4
    [3]89512             1                3
    [4]68741             2                7
    [5]99423             0                8
    [6]79225             1                4
    [7]66598             3                2

    Dataframe C (what I want)
        SSID             Age        Wage       Education        Race
    [1]12345             23         45645      2                8
    [2]15461             45         534688     3                4
    [3]68741             63         124        2                7

So only the common rows, pertaining to the SSID variable, are preserved, and everything else is trashed. How can I do this?

I tried doing stuff like C = which(B$SSID %in% A$SSID) but to no avail.

Show_21
  • 13
  • 3

1 Answers1

2

I believe what you're looking for is an inner_join available in the dplyr package:

library(dplyr)

dataframe_c <- inner_join(dataframe_a, dataframe_b, by = "SSID")

Or you can use merge from base:

dataframe_c <- merge(dataframe_a, dataframe_b, by = "SSID", all = FALSE)
Ben G
  • 4,148
  • 2
  • 22
  • 42
  • Hi! SO I did this, but what happened was that my dataframe_c ended up being the same size as the dataframe_b, which shouldn't be the case, right? I expect it to be at most the size of my dataframe_a. – Show_21 Apr 04 '19 at 17:40
  • Strange. Can you post the code you wrote to make your dummy data? General best practice for using this site so that people don't have to remake the data. – Ben G Apr 04 '19 at 17:44
  • Ah, so it's weirdly working for the dummy data but not for the overall data. Poo. I'll mark this as answered anyway since I can't post my own data for this. Thank you. – Show_21 Apr 04 '19 at 17:50
  • The only thing I can think of is that you do in fact have those SSID in dataframe_a. Also possible you have duplicates that are messing you up. – Ben G Apr 04 '19 at 17:52
  • It indeed was. It was a weird foible for my data where a person was counted multiple times for certain variables but not others. So it created several "half-duplicates." I had to bruteforce it for my data specifically but the answer you gave is excellent. Thank you. – Show_21 Apr 04 '19 at 19:34