0

I'm working with a dataframe possessing over 7000 observations where each respondent has been allocated a numeric value that identifies their geographical location.

#DF 1

USER_ID   Col2   ...   NumIdentifier
  45        4                101
  12        9                98
  97        19               7
  11        3                104
  54        1                109
  2         23               110
  ...       ...              ...

Now I have been provided with additional information (Var1, Var2) that needs to be allocated to only some of the respondents based on this numeric geographic identifier.

#DF 2

NumIdentifer   Var1   Var2
   101          13     20
   104          16     87
   109          34     21
   ...         ...     ...

'DF 2' contains one row per numeric geographic identifier and contains a smaller subset of geographic identifiers than are present in 'DF 1'. There are around 30 rows in 'DF 2'.

As a first step I have constructed a new dataframe from 'DF 1' that only includes the respondents with the numeric identifier present in 'DF 2'.

#DF 3
USER_ID   Col2   ...   NumIdentifier
  45        4                101
  11        3                104
  54        1                109
  ...       ...              ...

The ideal output I am aiming for would look like this. If the numerical identifier in 'DF 2' equals the numerical identifier in 'DF 3' then Var1 and Var2 for each corresponding row would be inserted.

#DF 3
USER_ID   Col2   ...   NumIdentifier   Var1   Var2
  45        4                101        13     20
  11        3                104        16     87
  54        1                109        34     21
  ...       ...              ...        ...   ...

Having tried unsuccessfully to transfer some of the techniques from these two posts:

  • "r - matching of two dataframes and insertion of rows from one in another on condition"
  • "R - replace values in dataframe based on two matching conditions"

I was wondering if there were some other resources or original insights that might be of help. The methods in these two posts seem to only be effective when making matches and conditional replacements between dataframes with an equal number of rows.

ap4207
  • 3
  • 2
  • See ```?merge()```. For me, it was better to start with ```install.packages('dplyr')``` and then use ```inner_join()```. – Cole Oct 12 '19 at 03:51

1 Answers1

0

There are 3 approaches to this.

  1. Use merge function (native)

  2. Use merge in dplyr

  3. Use sqldf library

My preference would be native/dplyr as sqldf actually converts your dataframes to a SQLite db and hence requires additional memory.

> df1 <- data.frame("NumIdentifier" = c(101,98,7,104,109,11), "USER_ID" = c(45,12,97,11,54,2), "Col2" = c(4,9,19,3,1,23))
> df1
  NumIdentifier USER_ID Col2
1           101      45    4
2            98      12    9
3             7      97   19
4           104      11    3
5           109      54    1
6            11       2   23

> df2 <- data.frame("NumIdentifier" = c(101,104,109), "Var1" = c(13,16,34), "Var2" = c(20,87,21))

> df2
  NumIdentifier Var1 Var2
1           101   13   20
2           104   16   87
3           109   34   21

Merge() function in dplyr

df3 <- merge(x = df1, y = df2, by = "NumIdentifier", all.y = TRUE)
> df3
  NumIdentifier USER_ID Col2 Var1 Var2
1           101      45    4   13   20
2           104      11    3   16   87
3           109      54    1   34   21

sqldf

> library(sqldf)
> df4 <- sqldf("SELECT * FROM df2 LEFT JOIN df1 USING(NumIdentifier)")
> df4
  NumIdentifier Var1 Var2 USER_ID Col2
1           101   13   20      45    4
2           104   16   87      11    3
3           109   34   21      54    1
dswdsyd
  • 576
  • 4
  • 12
  • ```merge()``` is a base function. The equivalent in ```dplyr``` would be ```library(dplyr); left_join(df1, df2, by = 'NumIdendifier')``` – Cole Oct 12 '19 at 13:51
  • sqldf can use an out of memory database if you use the dbase= argument. In that case it can handle problems larger than R can. – G. Grothendieck Oct 14 '19 at 01:45