Join data frames with ranges

Question

I have two data-frames

One data-frame has a column 'zip codes' which has the complete zip codes. It also has several other columns like shop names, etc.

The 2nd one has a column called zip codes range which just has the range for zip codes for each city.

How do I join these two data-frames on zip codes such that I can add the correct cities to the dataframe 1?

I can think of the nested for loop and comparing the max/min of each range to the zip codes in the 2nd data-frame. But that is taking a long time to run ~ 100mill comparisons

Edit: Data frame 1: This one only has zip codes. I want the cities to be populated here.

| Shop names            | Zip Codes |
|-----------------------|-----------|
| Bergin and botts      | 029888    |
| WW and Co             | 100397    |
| Higgin Bothams        | 100430    |
| Bertie's Beans        | 100459    |
| Leaky Cauldron        | 310283    |
| Pet Peeves            | 310330    |
| Lucy's coffee shop    | 910345    |
| Dream cathers         | 465250    |
| Dragon supplies       | 479187    |
| SLUG AND   JIGGER'S   | 934464    |
| FLOURISH AND BLOTTS.  | 937833    |
| MADAM MALKIN'S ROBES  | 931283    |

Dataframe2: This one has the zip code ranges and the corresponding cities.

| City   | Zip ranges    |
|----------------|---------------|
| braavos        | 029918-100290 |
| highgarden     | 100389-100440 |
| vale           | 200410-219000 |
| dorne          | 310229-367890 |
| storms end     | 389032-567000 |
| king's landing | 601000-898000 |
| winterfell     | 910230-940200 |

I created some sample data here. the original data has around a million rows for dataframe1 and 5k rows for dataframe2. Hence the for loop logic will be very cumbersome.

Appreciate any help!

dput(df1)

structure(list(ï...Shop.names = c(" Bergin and botts      ", 
" WW and Co             ", " Higgin Bothams        ", " Bertie's Beans        ", 
" Leaky Cauldron        ", " Pet Peeves            ", " Lucy's coffee shop    ", 
" Dream cathers         ", " Dragon supplies       ", " SLUG AND   JIGGER'S   ", 
" FLOURISH AND BLOTTS.  ", " MADAM MALKIN'S ROBES  "), Zip.Codes = c("29888", 
"100397", "100430", "100459", "310283", "310330", "910345", "465250", 
"479187", "934464", "937833", "931283")), class = "data.frame", row.names = c(NA, 
-12L))

dput(df2)

structure(list(ï...City = c(" braavos        ", " highgarden     ", 
" vale           ", " dorne          ", " storms end     ", " king's landing ", 
" winterfell     "), Zip.ranges = c(" 029918-100290 ", " 100389-100440 ", 
" 200410-219000 ", " 310229-367890 ", " 389032-567000 ", " 601000-898000 ", 
" 910230-940200 ")), class = "data.frame", row.names = c(NA, 
-7L))

could you [use `dput`](https://stackoverflow.com/questions/49994249/example-of-using-dput), to supply a sample of both datasets? — Waldi, Oct 01 '20 at 18:32
Thanks for your feedback : this is clear and solution is straight forward. However, I need the result of `dput` which is a text structure object to be directly copied in console in order to recreate both dataframes. Recreating them manually would be too tedious — Waldi, Oct 02 '20 at 05:35

Waldi · Answer 1 · 2020-10-02T07:47:06.010

With data.tables :

df <- structure(list(Shop.names = c(" Bergin and botts      ", 
                                  " WW and Co             ", " Higgin Bothams        ", " Bertie's Beans        ", 
                                  " Leaky Cauldron        ", " Pet Peeves            ", " Lucy's coffee shop    ", 
                                  " Dream cathers         ", " Dragon supplies       ", " SLUG AND   JIGGER'S   ", 
                                  " FLOURISH AND BLOTTS.  ", " MADAM MALKIN'S ROBES  "), Zip.Codes = c("29888", 
                                                                                                       "100397", "100430", "100459", "310283", "310330", "910345", "465250", 
                                                                                                       "479187", "934464", "937833", "931283")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                     -12L))
dfranges <- structure(list(City = c(" braavos        ", " highgarden     ", 
                                        " vale           ", " dorne          ", " storms end     ", " king's landing ", 
                                        " winterfell     "), Zip.ranges = c(" 029918-923004 ", " 100389-100440 ", 
                                                                            " 200410-219000 ", " 310229-367890 ", " 389032-567000 ", " 601000-898000 ", 
                                                                            " 910230-940200 ")), class = "data.frame", row.names = c(NA, 
                                                                                                                                     -7L))
# Extract from-to, convert to numeric
dfranges <- cbind(dfranges,purrr::map_df(stringr::str_split(dfranges$Zip.ranges,"-"),~(data.frame(from=as.numeric(.x[1]),to=as.numeric(.x[2])))))



library(data.table)
setDT(df)
setDT(dfranges)

# convert Zip.Code to numeric
df[,Zip.Codes:=as.numeric(Zip.Codes)]

dfranges[df, .(City,x.from,x.to,Zip.Codes,Shop.names),on = .(from <= Zip.Codes, to >= Zip.Codes)]
#>                 City x.from   x.to Zip.Codes              Shop.names
#>  1:             <NA>     NA     NA     29888  Bergin and botts      
#>  2:  braavos          29918 923004    100397  WW and Co             
#>  3:  highgarden      100389 100440    100397  WW and Co             
#>  4:  braavos          29918 923004    100430  Higgin Bothams        
#>  5:  highgarden      100389 100440    100430  Higgin Bothams        
#>  6:  braavos          29918 923004    100459  Bertie's Beans        
#>  7:  braavos          29918 923004    310283  Leaky Cauldron        
#>  8:  dorne           310229 367890    310283  Leaky Cauldron        
#>  9:  braavos          29918 923004    310330  Pet Peeves            
#> 10:  dorne           310229 367890    310330  Pet Peeves            
#> 11:  braavos          29918 923004    910345  Lucy's coffee shop    
#> 12:  winterfell      910230 940200    910345  Lucy's coffee shop    
#> 13:  braavos          29918 923004    465250  Dream cathers         
#> 14:  storms end      389032 567000    465250  Dream cathers         
#> 15:  braavos          29918 923004    479187  Dragon supplies       
#> 16:  storms end      389032 567000    479187  Dragon supplies       
#> 17:  winterfell      910230 940200    934464  SLUG AND   JIGGER'S   
#> 18:  winterfell      910230 940200    937833  FLOURISH AND BLOTTS.  
#> 19:  winterfell      910230 940200    931283  MADAM MALKIN'S ROBES

^{Created on 2020-10-02 by the reprex package (v0.3.0)}

Note that some of the Zip ranges you supplied overlap, hence two results for the same shop.

Join data frames with ranges

1 Answers1