Extract a string from text based on joined tables using vapply in R

Question

I have a problem solving this real scenario. I have two data tables:

dfEntitiesCountries: a table of 3-letter entities vs their Routing Countries
temp: a big table of shipments each with the following details: a. AWB: identifier of the shipment b. OriginCountry: full name of a shipment origin country c. Route: the entities thru which a shipment has moved separated by a hyphen

dfEntitiesCountries

temp table

I need to figure out the Gateway of each shipment. Gateway is the entity in the Route with Routing Country that comes right after the OriginCountry of the shipment. This means that I need to exclude the first entities in route that correspond to the OriginCountry of the shipment, then take the entity in route whose RountingCountry is different from the OriginCountry.

I wrote the below code. The problem is that it works well if I use it as a standalone function (GetGW). But when I use it in the vapply to apply it on all shipments in temp table I face an error saying:

“Error in [.data.table(dfEntitiesCountries[RouteEntities, on = .(Entity = V1), : i evaluates to a logical vector length 3 but there are 2 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.”

GetGW <- function(RouteEntities,ShptOrgCnt){
  setDT(as.list(RouteEntities))
  GW <-dfEntitiesCountries[RouteEntities,on=.(Entity=V1),nomatch=0][RoutingCountry!=ShptOrgCnt,.(Entity)][[1]][1]%>% as.character()
}

temp[,`:=`(GateWay=vapply(strsplit(Route,"-"),GetGW,"character",ShptOrgCnt=OriginCountry))]

Appreciate any thought to resolve this case using the same method or any other efficient method as I have around 2M shipments in the actual table.

Please do not post (only) an image of code/data/errors: it breaks screen-readers and it cannot be copied or searched (ref: https://meta.stackoverflow.com/a/285557 and https://xkcd.com/2116/). Please include the code, console output, or data (e.g., `data.frame(...)` or the output from `dput(head(x))`) directly. — r2evans, Nov 01 '22 at 12:48
Thanks, @r2evans, for the comment. Acknowledged. I couldn't find a way to include samples of tables other than images. Still new to SO. However, the code is already included in code format. — Mohammed Barakat, Nov 01 '22 at 13:19
Sure. There are several ways to include sample (unambiguous) data in a question, including `dput(x)` and `read.table(text="...")`, both demonstrated among https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. WIth larger datasets, the balance is typically in providing enough data to support the variability needed without flooding the question. (Make sure to paste it within a [code block](https://stackoverflow.com/editing-help), at least it'll be contained.) — r2evans, Nov 01 '22 at 13:38
From @lmml: Not a full answer, but would highly recommend looking at tidyverse's [stringr package](https://stringr.tidyverse.org/) which has a lot of tools that could help with this. Combine with [dplyr::across()](https://dplyr.tidyverse.org/reference/across.html) and/or [dplyr::rowwise()](https://dplyr.tidyverse.org/articles/rowwise.html) to apply to your whole dataset. — r2evans, Nov 01 '22 at 15:40

score 1 · Accepted Answer · answered Nov 01 '22 at 13:58

Step 1: identify the first abbreviation of each country:

dfEntitiesCountries[dfEntitiesCountries, Entity2 := first(i.Entity), on = .(RoutingCountry), by = .EACHI]
dfEntitiesCountries
#    Entity RoutingCountry Entity2
#    <char>         <char>  <char>
# 1:    AMM              J     AMM
# 2:    DXB            UAE     DXB
# 3:    RUH             SA     JED
# 4:    DOH              Q     DOH
# 5:    LON             UK     LON
# 6:    JED             SA     JED

Step 2: form a temp2 that calculates the gateway for each AWB:

temp2 <- copy(temp)[dfEntitiesCountries, Orig := i.Entity2, on = .(OriginCountry = RoutingCountry)
  ][, strsplit(Route, "-"), by = .(AWB, Orig)
  ][dfEntitiesCountries, Gateway := i.Entity2, on = .(V1 = Entity)
  ][Orig != Gateway,
  ][, .SD[1,], by = .(AWB)]

Step 3: bring this back into the original temp:

temp[temp2, Gateway2 := i.Gateway, on = .(AWB)]
temp
#      AWB OriginCountry       Route GateWay Gateway2
#    <num>        <char>      <char>  <char>   <char>
# 1:  1111             Q     DOH-LON     LON      LON
# 2:  2222            SA RUH-JED-DXB     DXB      DXB
# 3:  3333           UAE DXB-AMM-RUH     AMM      AMM

Data

dfEntitiesCountries <- setDT(structure(list(Entity = c("AMM", "DXB", "RUH", "DOH", "LON", "JED"), RoutingCountry = c("J", "UAE", "SA", "Q", "UK", "SA")), row.names = c(NA, -6L), class = c("data.table", "data.frame")))
temp <- setDT(structure(list(AWB = c(1111, 2222, 3333), OriginCountry = c("Q", "SA", "UAE"), Route = c("DOH-LON", "RUH-JED-DXB", "DXB-AMM-RUH"), GateWay = c("LON", "DXB", "AMM")), row.names = c(NA, -3L), class = c("data.table", "data.frame")))

Amazing! That's heck of a slick solution. Thanks, @r2evans. Took me some time to decipher your piece of code and tweak it to fit my project. It worked perfectly. It took around 1 min to process 1.2M shipments! — Mohammed Barakat, Nov 02 '22 at 11:32

Extract a string from text based on joined tables using vapply in R

1 Answers1