Vlookup with data.table in R

Question

First, let's start with DataTable 1 (DF1) :

          date id sales cost city  name
 1: 06/19/2016  1   149  101  MTL Bank1
 2: 06/20/2016  1   150  102  MTL Bank1
 3: 06/21/2016  1   151  104  MTL Bank1
 4: 06/22/2016  1   152  107  MTL Bank1
 5: 06/23/2016  1   155   99  MTL Bank1
 6: 06/19/2016  2    84   55   NY Bank2
 7: 06/20/2016  2    83   55   NY Bank2
 8: 06/21/2016  2    80   56   NY Bank2
 9: 06/22/2016  2    81   57   NY Bank2
10: 06/23/2016  2    97   58   NY Bank2

library(data.table)
DF1 <- data.table(c("06/19/2016", "06/20/2016", "06/21/2016", "06/22/2016", 
                    "06/23/2016", "06/19/2016", "06/20/2016", "06/21/2016",
                    "06/22/2016", "06/23/2016"),
                  c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
                  c(149, 150, 151, 152, 155, 84, 83, 80, 81, 97),
                  c(101, 102, 104, 107, 99, 55, 55, 56, 57, 58),
                  c("MTL", "MTL", "MTL", "MTL", "MTL", "NY", "NY", 
                    "NY", "NY", "NY"))
colnames(DF1) <- c("date", "id", "sales", "cost", "city")

I want to add the column name shown above using the lookup table:

   id  name start_date   end_date status
1:  1 Bank1 06/19/2016 06/25/2016      0
2:  2 Bank2 06/27/2016 06/27/2017      0
3:  3 Bank3 06/22/2016 06/24/2017      1
4:  4 Bank3 06/23/2016 12/23/2016      1

lookup <- data.table(c(1, 2, 3, 4),
                     c("Bank1", "Bank2", "Bank3", "Bank3"),
                     c("06/19/2016", "06/27/2016", "06/22/2016", "06/23/2016"),
                     c("06/25/2016", "06/27/2017", "06/24/2017", "12/23/2016"),
                     c("0", "0", "1", "1"))
colnames(lookup) <- c("id", "name", "start_date", "end_date", "status")

In that case, I would use the id to find the name. When I try merge, I always have new rows in DF1 that contains NA.

For your lookup, do you need start and end dates and status or are id and name the sufficient columns? — lmo, Jun 27 '16 at 20:47
id and name are the sufficient columns, but I don't know all the columns before running the script but I am certain that we will have a column id and a column date. — Alexis, Jun 27 '16 at 21:01
Anyway, I think you can/should just do `DF1[lookup, on="id", name := i.name]` — Frank, Jun 27 '16 at 22:10
@MaxMolina It's a special syntax available in j when doing `x[i, on=, j]` mentioned under the `j` argument in the doc that pops up when you type `?data.table`. There's also a `x.` prefix. In this case, it's useful to ensure that we're pulling the column from `i=lookup` rather than from a preexisting column `name` that's already in `x=DF1`. There's more on how this syntax might be useful here: https://stackoverflow.com/q/54312225 — Frank, Mar 30 '20 at 04:51

score 4 · Accepted Answer · answered Jun 27 '16 at 21:01

4

DF1<-merge(DF1, lookup[,.(id, name)], by='id', all.x=TRUE, all.y=FALSE)

I think the merge command is what you are looking for here, but you were missing the all.y = FALSE bit. What we are doing here is merging DF1 with the lookup data table and we are telling R to include all of the rows in x, but only the rows in y that match up with the rows in x where x is DF1 and y is lookup. The lookup[,.(id, name)] means we only want the column id (to match with DF1) and the column 'name'. If there were rows in DF1 that did not have a matching row in lookup, they would show up as NA's because of the all.x=TRUE.

answered Jun 27 '16 at 21:01

Austin

326
1
5

If I put `allow.cartesian = TRUE`, I will have 145368 rows when the initial number of rows was 83428. – Alexis Jun 27 '16 at 21:21
When I copy and pasted your first two code segments along with my answer, it works for me. Are you willing to share your actual code here? and did you run my merge command after library(data.table) ? – Austin Jun 27 '16 at 21:25
It worked also for me, but when I tried on my the real dataset, I got this error – Alexis Jun 27 '16 at 21:27
Are there situations where two bank names share an id? – Austin Jun 27 '16 at 21:28
If your lookup table has one row with id: 2 and name: BANK2 and another row with id:2 and name: bank2 then you will get your error. If this is the problem, you should clean your data. I would lower case all bank names with x<-tolower(x) and probably potentially remove spaces x<-gsub(" ", "", x) then run DF1<-merge(DF1, unique(lookup[,.(id, name)]), by='id', all.x=TRUE, all.y=FALSE). This will get complicated quickly if your data has things like bank1, bank1corp, bank1co, bank1corporation, et cetera but you can most likely attack it with lots of gsub() statements – Austin Jun 27 '16 at 21:40
I had to use a part of the answer of @lmo. If I use `unique(lookup[, .(id, name)])`, it works. – Alexis Jun 28 '16 at 13:40

score 2 · Answer 2 · answered Jun 27 '16 at 21:48

Here is a slightly different setup: I'm assuming that id is unique for both data sets and that the lookup data contains all of ids present in your main dataset.

I subset lookup to just the id and name. To assure that there are no repetitions (multiple dates) in the lookup, I use unique.

DF1[unique(lookup[, .(id, name)]), on="id"][!is.na(sales)]

    id       date sales cost city  name
 1:  1 06/19/2016   149  101  MTL Bank1
 2:  1 06/20/2016   150  102  MTL Bank1
 3:  1 06/21/2016   151  104  MTL Bank1
 4:  1 06/22/2016   152  107  MTL Bank1
 5:  1 06/23/2016   155   99  MTL Bank1
 6:  2 06/19/2016    84   55   NY Bank2
 7:  2 06/20/2016    83   55   NY Bank2
 8:  2 06/21/2016    80   56   NY Bank2
 9:  2 06/22/2016    81   57   NY Bank2
10:  2 06/23/2016    97   58   NY Bank2

This is called a left join. I use [!is.na()] at the end to drop banks 3 and 4 which don't have an observation in the main data set.

Thank you. By using a part of your solution and the solution of @Austin, I am able to get my answer. — Alexis, Jun 28 '16 at 13:42

Vlookup with data.table in R

2 Answers2

Linked