Efficiently add new column to df based upon date range

Question

I have a dataset like so:

> head(featured_products)
   Dept Class     Sku                    Description Code Vehicle/Placement  StartDate    EndDate  Comments(Circulation,Location,etc)
1:  430  4318  401684          ++INDV RAMEKIN WP 9CM  OSM          Facebook 2017-01-01 2017-01-29                   Fancy Brunch Blog
2:  430  4318  401684          ++INDV RAMEKIN WP 9CM  OSM           Twitter 2017-01-01 2017-01-29                   Fancy Brunch Blog
3:  340  3411 1672605            ++ SPHERE WILLOW 4"  OP1         Editorial 2016-02-29 2016-03-27                Spruce up for Spring
4:  230  2311 2114074 ++BOX 30 ISLAND ORCHRD TLIGHTS   EM             Email 2016-02-17 2016-02-17 Island Orchard and Jeweled Lanterns
5:  895  8957 2118072            ++PAPASAN STL TAUPE  OSM         Instagram 2017-08-26 2017-10-01                    by @audriestorme
6:  895  8957 2118072            ++PAPASAN STL TAUPE   EM             Email 2017-11-23 2017-11-23               Day 2 Black Friday AM

And another like this:

SKU ActivityDate OnlineSalesQuantity OnlineDiscountPercent InStoreSalesQuantity InStoreDiscountPercent
1: 401684   2015-12-01                 150                  0.00                  406                   2.72
2: 401684   2015-12-02                   0                  0.00                  556                   3.79
3: 401684   2015-12-03                   0                  0.00                  723                   3.44
4: 401684   2015-12-04                  16                  4.91                  781                   2.46
5: 401684   2015-12-05                  17                  0.00                  982                   3.18
6: 401684   2015-12-06                   0                  0.00                  851                   3.12

I have added a column to the 2nd df called featured which is 1 if the product was listed in the 1st df for the given date and 0 otherwise.

Now, what I would like to do is also add the Vehicle/Placement column to the new, combined df (when featured == 1)...the issue here is that for different dates there could be different vehicles, or multiple...

How can I scan the date for the featured == 1 rows and compare this to df1 and then extract Vehicle/Placement and add it to the combined df?

This also must be done efficiently because df2 is 2.85 million rows...

I am looking for something along the lines of:

# Add vehicle
if(combined$featured == 1) {
  for (n in 1:nrow(featured_products)) {

    for (m in 1:nrow(combined)) {

      combined$vehicle <- ifelse(combined$activitydate[m] %within% interval(featured_products$startdate[n],featured_products$enddate[n]), featured_products$`vehicle/placement`, NA)

    }
  } 
}

But this produces the error:

Warning message: In if (combined$featured == 1) { : the condition has length > 1 and only the first element will be used

I think I found an alternate solution, but it is EXTREMELY slow and will take hours to run:

# Add vehicle
for (n in 1:nrow(combined)) {

  cat("Searching Featured Products...", "\n")

  data <- featured_products[which(featured_products$sku == combined$sku[n]),]

  cat("Matching dates...", "\n")

  for (m in 1:nrow(data)) {

    combined$vehicle <- ifelse(combined$activitydate %within% interval(data$startdate[m],data$enddate[m]), data$`vehicle/placement`, NA)
  }
}

It sounds like you want to do something like `merge(df1, subset(df2, featured == 1), by = "SKU")`. Your description is not entirely clear, though. It would also be helpful if you `dput` both data.frames, so we can play around with it. — ChrKoenig, Jan 22 '18 at 15:10
@Hav0k, I am limited in the information I can share...please see https://stackoverflow.com/questions/48372955/creating-dummy-variable-based-on-event-date-within-certain-time-period/48373321?noredirect=1#comment83755415_48373321 for more information though — zsad512, Jan 22 '18 at 15:20
Sure you are...I was also talking about the toy data.frames from your example. ;) — ChrKoenig, Jan 22 '18 at 15:24
@Hav0k- please see my edit, I added a little more information...hope its helpful — zsad512, Jan 22 '18 at 15:31

score 0 · Answer 1 · answered Jan 22 '18 at 16:16

You want to brake this up into small, more efficient sub-tasks. We know that R can be quite fast when we use vectorized code, but not when using for loops. We capitalize on this by using the following commands:

combined = merge(combined, featured_products) # merge/join both data frames
mismatch = !(combined$ActivityDate %within% interval(combined$StartDate, combined$EndDate) & combined$featured == 1) # Query rows
combined$Placement[mismatch] = NA # Remove Placement in mismatched rows
combined[,c("StartDate", "EndDate")] = NULL # Remove columns

Note that the column/object names might not be identical to yours, so you might have to adjust them.

Efficiently add new column to df based upon date range

1 Answers1