1

I've tried a few things and haven't been able to make any noticeable changes. I've isolated it down to the for loop and hopefully someone can give me some ideas.

Quick Background: Process was in SaS, SaS is going away. We take two data-sets, merge by zip and then compare strings w/ Jaro-Winkler to identify two entities that could be the same across the two sets. E.g. Is Zip the same? Yes, move to next criteria, is X the same? No, but it's close and it continues throwing out what it doesn't want, putting the matches in a new data frame. This part takes no time at all in R.

We can't just do a straight merge since we use all 128gigs of RAM on our EC2 instance.

So, looping row by row and only keeping those that fall into our match criteria however the speed in which it loops is a huge issue. Fairly new to R so I could be missing something absurdly obvious.

DataSet1

structure(list(PPOppID = c("785041315", "829852094", "854136412", 
"787141118"), BusinessName = c("HAPPY COMPANY", 
"SAD COMPANY", "HORRIBLE COMPANY", "MILDLY UPSET COMPANY"
), StreetName = c("HAPPY TRAIL", "SAD TRAIL", "HORRIBLE TRAIL", 
"MILDLY UPSET TRAIL"), City = c("TOWNA", "TOWNB", "TOWNC", "TOWND"
), State = structure(c(52L, 52L, 52L, 52L), .Label = c("AK", 
"AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", 
"IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME", "MI", 
"MN", "MO", "MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM", "NV", 
"NY", "OH", "OK", "OR", "PA", "PR", "RI", "SC", "SD", "TN", "TX", 
"UT", "VA", "VT", "WA", "WI", "WV", "WY"), class = "factor"), 
    Zip = c("12345", "12345", "12345", "12345"), street_num = c("1435", 
    "110", "105", "875"), street_pre_direction = c("S", "E", 
    "", ""), street_name = c("HAPPY TRAIL", "SAD TRAIL", "HORRIBLE TRAIL", "MILDLY UPSET TRAIL"
    ), suffix = c("RD", "ST", "", ""), streetdirection = c("", 
    "", "", ""), unit = c("", "STE", "", ""), unit = c("", 
    "202", "", ""), po_box = c("", "", "", ""), phone = c("1234567891", 
    "3231234543", "9876543219", "9087653456"), state_code = c("56029", 
    "56021", "56013", "56043"), name = c("", "", "", 
    "PERSONA"), FirstName = c("", "", "", "K"), contact_middle = c("", 
    "", "", ""), LastName = c("", "", "", "LAST_NAME"), contact_title = c("", 
    "", "", "OWNER"), year_started = c("1996", "2005", 
    "1993", "1981"), number_of_employees = c("3", "5", 
    "5", "5"), size_code = c("A", "B", "B", "B"), type = c("C", 
    "C", "C", "C"), industry_class = c("", "", 
    "A", "A"), code= c(NA, NA, "139", "119"
    ), code2 = c(NA, NA, "111100", "111100"), annual_amount = c("289", 
    "378", "0", "306"), annual_amount2 = c("A", "A", 
    "A", "A"), acc002 = c("3", "1", "1", "0"), brc014 = c("0", 
    "0", "0", "0"), t055 = c("1", "0", "1", "0"), t068 = c("2", 
    "2", "1", "1"), loaddate = c("2018-05-25", "2018-05-25", 
    "2018-05-25", "2018-05-25"), DBAName = c("HAPPY COMPANY", 
    "SAD COMPANY", "HORRIBLE COMPANY", "MILDLY UPSET COMPANY"
    ), Phone = c(NA_character_, NA_character_, NA_character_, 
    NA_character_), PhoneAreaCode = c(NA_character_, NA_character_, 
    NA_character_, NA_character_), StreetNum = c("1435", "110", 
    "105", "875")), class = c("data.table", "data.frame"), row.names = c(NA, -4L)

DataSet2

structure(list(rn = c("1", "2", "3", "4"), Id = c("0000000000abcG7MAI", 
"000C000000abcg9MAA", "000QC000000abcG9MAI", "000C000000abcGaMAI"
), NumId = c("187639087", "237893456", "923785629", "298777656"
), BusinessName = c("HAPPY COMPANY", "K&W PHARMACY INC", 
"SCOTTISH INN", "CORY SMITH STUDIOS"), DBABusinessName = c("", 
"", "", ""), Phone = c("123456789", "987654321", "9999999999", 
"6086892577"), PhoneAreaCode = c("999", "123", "456", "678"), 
    FirstName = c("SAM", "KYLE", "TONY", "MIKE"), LastName = c("SAM", 
    "SMITH", "TRAVIS", "JOHNSON"), StreetNum = c("7585", "170", 
    "2457", ""), StreetName = c("Avoderm WAY", "Blue Buffalo U", "Farmina BLVD", 
    "PO BOX 0"), State = c("NJ", "NY", "AK", "PR"), City = c("P", 
    "X", "X", ""), Zip = c("19425", "08765", 
    "37355", "54632")), class = c("data.table", "data.frame"), row.names = c(NA,-4L) 

Code

for(row in 1:length(df1$Zip)) { 


  df1 <- inner_join(df1, df2, by = c('Zip')) 
  df1[] <- lapply(df1, as.character)

  df1$MatchBizName <- pmax(1-stringdist(df1$BusinessName, df1$BusinessName, method="jw", p=0.1),
                           1-stringdist(df1$DBAName, df1$DBAName, method="jw", p=0.1),
                           1-stringdist(df1$BusinessName, df1$DBABusinessName, method="jw", p=0.1),
                           1-stringdist(df1$DBAName, df1$BusinessName, method="jw", p=0.1))
  df1$MatchPhone <- ifelse(1-stringdist(df1$Phone, df1$Phone, method="jw", p=0.1)>=1,1,0)
  df1$MatchFirstName <- 1-stringdist(df1$FirstName, df1$FirstName, method="jw", p=0.1)
  df1$MatchLastName <- 1-stringdist(df1$LastName, df1$LastName, method="jw", p=0.1)
  df1$MatchStreetNum <- 1-stringdist(df1$StreetNum, df1$StreetNum, method="jw", p=0.1)
  df1$MatchStreetName <- 1-stringdist(df1$StreetName, df1$StreetName, method="jw", p=0.1)
  df1$MatchCity <- 1-stringdist(df1$City, df1$City, method="jw", p=0.1)

  df1matches <- subset(df1,(MatchBizName >= 0.9 & MatchCity == 1 & MatchStreetName >= 0.7 & MatchStreetNum >= 0.7) |
                         (MatchPhone == 1 & MatchFirstName == 1 & MatchLastName == 1 & MatchStreetNum == 1 & MatchStreetName == 1 & MatchCity == 1) |
                         (MatchBizName >= 0.9 & MatchStreetNum == 1 & MatchStreetName >= 0.9 & MatchCity == 1) |
                         (MatchStreetNum == 1 & MatchPhone == 1 & MatchStreetName == 1 & MatchCity == 1) |
                         (MatchLastName >= 0.9 & MatchPhone == 1 & MatchBizName >= 0.9 & MatchCity == 1) |
                         (MatchPhone == 1 & MatchCity == 1 & MatchStreetNum == 1 & MatchStreetName >= 0.9 & MatchBizName >= 0.6) )

  rm(df1)
  rm(df1matches)

  }
DataDog
  • 475
  • 1
  • 9
  • 23
  • 2
    A few observations: 1) For me, the code as provided is somewhat difficult to work with; this is partly due to some copying issue (i.e. missing `)`), partly because required packages aren't listed, (i.e. `library(stringdist) )` ,and partly because the for loop provided returns errors. Is this a strictly a performance issue or are you looking for a functional looping procedure and this is your first swing? 2) There is surly a better way to approach the issue. Providing the desired output (even if it a mocked data frame) would be helpful. – Peter_Evan Jul 01 '18 at 16:52
  • 4
    Two tips: 1) `ifelse(condition, 1L, 0L)` is 100 **times** slower than `as.integer(condition)`; 2) `subset(df, condition)` is around 12 **percent** slower than `df[condition, ]` (logical index). – Rui Barradas Jul 01 '18 at 17:09
  • @RuiBarradas if we don't remove the df we use all the ram. The next block of code is writing the results out to disk which isn't the bottleneck. We can't merge or keep everything loaded since we run out of RAM, and to get around the RAM issue we need to loop(which is slow). It's about 900 million rows in total. – DataDog Jul 01 '18 at 17:17
  • @Peter_Evan It’s strictly performance. The code ”works” and gets the desired output for the model however we’re going from running a total dataset in 1 hour to not being able to run one full geographic region at all. – DataDog Jul 01 '18 at 17:26
  • I think I am missing something here - I can't reproduce results with the code provided (maybe an issue on my end?) If you haven't already, it might be good to explore packages for memory profiling, such as `library(lineprof)` ,to see if that helps you spot the bottleneck. – Peter_Evan Jul 01 '18 at 17:38
  • 3
    I would start by reading [the great answers to this question](https://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r) – p0bs Jul 01 '18 at 17:41
  • Ok, some success from that thread. I calculated the size of the results previous to the for loop and it actually finished the entire geographic region! This wasn't able to happen before. Sadly, this is only 14million of 900million rows that need to be processed but it's a step in the right direction. – DataDog Jul 01 '18 at 21:15
  • 1
    @Kyle: Please update your post with your new code so others can continue to help while new users could learn from your experience. Thanks! – Tung Jul 01 '18 at 23:47
  • @Kyle: see also these great posts: [Efficient accumulation in R](http://winvector.github.io/Accumulation/) & [Applying a function over rows of a data frame](https://rpubs.com/wch/200398) – Tung Jul 01 '18 at 23:49
  • 1
    @Tung I'll update when something actually sticks. At this point it's two steps forward and two steps back. I'm basically forced into using for loops to prevent RAM bottlenecks but nothing I've seen so far will actually solve the for loop issue. Most of the examples in the above links are geared around computations of numeric data which isn't a huge help unfortunately. – DataDog Jul 02 '18 at 17:12
  • @Kyle: you can post your question on this forum to get more help https://community.rstudio.com/ – Tung Jul 02 '18 at 17:15

0 Answers0