Using Merge in R to replicate Vlookup

Question

This kind of helped: How to do vlookup in R

Problem: I have a list of machine numbers in the database and that need to have a machine rate associated with them (e.g. $20.00). In a CSV (machine_rates.csv) file, I have a list of those machine numbers with the associated machine rate (columns A & B, respectively).

I've tried using MERGE for this but for some reason it creates a lot of NA's throughout the dataframe even though I have the all.x = TRUE. It almost seems like if a machine # doesn't show up for that row, it turns the whole row into NA's. SO this leads me to believe I am not understanding the MERGE function correctly (read through many posts trying to find the equivalent of a vlookup in R).

So here below, I tried to create a new dataframe by the merge but when merging, how do you tell it to create a new column to place those merged machine rates?

dBase = dbReadTable(conn, "Mfng_Data")
mBase = read.csv("Machine_Rates.csv")

dBase2 = merge(dBase, mBase, by.x = "machine_number", by.y = "machine_number",
               all.x = TRUE)

Edit:
Is there a way to get around listing all of the items out? dBase contains about a million records (around 1m rows x 70c matrix). So if there are 150 different machine rates, would I have to list all of those out or is it possible to "index" those values in the CSV by matching the machine number in mBase to the machine number in dBase ?

akaDrHouse · Accepted Answer · 2017-06-27T17:37:06.800

3

A dplyr solution.

library(dplyr)
dbase <- data.frame(machine_number = c("10","20","30","10","10","50"),
                second_attribute=c("a","b","c","c","a","d"))
mbase <- data.frame(machine_number = c("10","20","30","40","50","60","70","80","90","100"),
                    rate=c(22,22,25,17,15,15,55,12,15,19))

left_join(dbase, mbase, by = "machine_number") 

  machine_number second_attribute rate
1             10                a   22
2             20                b   22
3             30                c   25
4             10                c   22
5             10                a   22
6             50                d   15

edited Jun 27 '17 at 17:37

answered Jun 27 '17 at 17:20

akaDrHouse

2,190
2
20
29

Sorry I don't mean to be a pain but if I am understanding the referenced SO post below, I need to create sub data frames out of the main data frame? So something like this: `dBase1 = data.frame(machine_number) mBase1 = data.frame(machine_number) dBaseNew = merge(x = dBase1, y = mBase1, by = "machine_number", all.x = TRUE) ` Then this should give a new "dBase" with 1 more column showing the machine rates per machine number, correct? – Aspiring Developer Jun 27 '17 at 18:55
1

The dbase and mbase data.frames only need to have a single column in common; in your case, it would be `machine_name`. They can have as many other columns of information as you want. Then you would have `mymerged_df <- left_join(dbase,mbase,by="machine_number")`. You can remove all columns from `mbase` first except for `machine_name` and whatever the `rate` column is called before you do the join. – akaDrHouse Jun 27 '17 at 18:58

alaybourn · Answer 2 · 2017-06-27T18:16:39.570

If you're trying to do the equivalent of an inner join, try removing the all.x argument. It seems like you're looking for a left join, which is what you've already tried. Check your Mfing_data. They might be the source of those mysterious NA's. Also, if the merge column has the same name in each data frame you can leave out the by argument.

dbase<-data.frame(machine_number=c("10","20","10","30","25"),stringsAsFactors = F)

mbase<-data.frame(machine_number=c("10","20","30","40"),machine_rate=c(32,65,12,22), stringsAsFactors = F)

merge(dbase,mbase,all.x = T)


  machine_number machine_rate
1             10           32
2             10           32
3             20           65
4             25           NA
5             30           12

score 0 · Answer 3 · answered Jun 27 '17 at 17:35

Another option for when you only have a limited number of items you want to match.

dbase <- data.frame(machine_number = c("10","20","30","10","10","50"),
                second_attribute=c("a","b","c","c","a","d"))

Notice that for this method, the machine number is no longer defined as a number.

You can define a small lookup vector as follows:

lookup <- c("10"=22, "20"=22, "30"=25, "50"=15)

Then you can directly add in the values to your first data frame with the following:

dbase$rate <- sapply(dbase[,1], function(x) unname(lookup[x]))
dbase
  machine_number second_attribute rate
1             10                a   22
2             20                b   22
3             30                c   25
4             10                c   22
5             10                a   22
6             50                d   15

The sapply takes the first column of dbase, and conducts the lookup to the lookup object we defined.

Is there a way to get around listing all of the items out? dBase is one that contains about a million records (around 1m rows x 70c matrix). So if there are 150 different machine rates, would I have to list all of those out or is it possible to "index" those values in the CSV by matching the machine number in mBase to the machine number in dBase ? — Aspiring Developer, Jun 27 '17 at 17:40
I would just read in the machine rates .csv and do the dplyr solution. This is more a one-off for simply quick lookup. Note, that if machine name is not named identically, you can replace `by = "machine_name"` with `by = c("machine_name" = "mach. name")` Or to expand on base R solutions that @alaybourne mentioned, see https://stackoverflow.com/a/1300618/4001897 — akaDrHouse, Jun 27 '17 at 17:45

Using Merge in R to replicate Vlookup

3 Answers3