0

that is how to run a condition for each row in the outer data frame for each row in the inner data frame) I have two data frames :

ToTest

   Origin.State Dest.State  Ship.Date Cost
1            IL         NY 2015-03-25   10
2            IL         NY 2015-03-25   10
3            IL         NY 2015-03-24   10
4            IL         NY 2015-03-23   10
5            IL         NY 2015-03-18   10
6            PA         NY 2015-04-29   10
7            PA         NY 2015-04-29   10
8            PA         NY 2015-04-27   10
9            PA         NY 2015-04-24   10
10           PA         NY 2015-03-01   10
11           IL         TX 2015-05-18   10
12           IL         TX 2015-05-18   10
13           IL         TX 2015-05-14   10
14           IL         TX 2015-05-12   10
15           IL         TX 2015-05-13   10

TestShipmentGroup1

   Origin.State Dest.State  Ship.Date
1            IL         NY 2015-03-25
2            IL         NY 2015-03-24
3            IL         NY 2015-03-23
4            IL         NY 2015-03-18
5            PA         NY 2015-04-29
6            PA         NY 2015-04-27
7            PA         NY 2015-04-24
8            PA         NY 2015-03-01
9            IL         TX 2015-05-18
10           IL         TX 2015-05-14
11           IL         TX 2015-05-12
12           IL         TX 2015-05-13

I am trying to apply the conditions shown below to every row in the ToTest dataframe using each row of TestShipmentGroup1 dataframe at a time.

for (i in 1: nrow(TestShipmentGroup1))
{
TestShipmentGroup1%>%
  select(Origin.State,Dest.State,Ship.Date)
ToTest%>%
  select(Origin.State, Dest.State,Ship.Date,Cost) %>% 
  filter (((ToTest$Ship.Date >= (TestShipmentGroup1$Ship.Date-7)) 
           & (ToTest$Ship.Date < TestShipmentGroup1$Ship.Date))
          & (ToTest$Origin.State == TestShipmentGroup1$Origin.State)
          & (ToTest$Dest.State == TestShipmentGroup1$Dest.State))}
lmo
  • 37,904
  • 9
  • 56
  • 69
ash25
  • 113
  • 1
  • 11

1 Answers1

0

Consider a cross join (returning cartesian product M X N from two sets) using merge without join variables, then apply the filter conditions. Alternatively, an inner join merged on States with filter afterwards also works. But first rename columns to avoid conflict:

library(dplyr)

...

names(ToTest) <- paste0(names(ToTest), "1")
names(TestShipmentGroup1) <- paste0(names(TestShipmentGroup1), "2")

# CROSS JOIN WITH FILTER
finaldf <- merge(select(ToTest, Origin.State1, Dest.State1, Ship.Date1),
                 select(TestShipmentGroup1, Origin.State2, Dest.State2, Ship.Date2)),
                 all=TRUE) %>%
                          filter (((Ship.Date1 >= (Ship.Date2-as.difftime(7, unit="days")))
                                  & (Ship.Date1 < Ship.Date2))
                                  & (Origin.State1 == Origin.State2)
                                  & (Dest.State1 == Dest.State2))

# INNER JOIN WITH FILTER
finaldf <- inner_join(select(ToTest, Origin.State1, Dest.State1, Ship.Date1),
                      select(TestShipmentGroup1, Origin.State2, Dest.State2, Ship.Date2), 
                 by = c("Origin.State1"="Origin.State2", "Dest.State1"="Dest.State2")) %>%
                          filter ((Ship.Date1 >= (Ship.Date2-as.difftime(7, unit="days")))
                                  & (Ship.Date1 < Ship.Date2))
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • It worked Parfait. Thanks! I really appreciate your help. :) – ash25 May 22 '16 at 18:22
  • Is there any more efficient way than merge() to do the same thing because it is taking too long for 20,000 records and I need to match each record from TestShipmentGroup1 to around 3 lacs records in ToTest data? – ash25 May 23 '16 at 16:16
  • What is too long? 5 mins? 1 hr? Try adding a new column, `key=1` to both dataframes and merge with key in `by`. Also look into the data.table package's Cross Join. See this: http://stackoverflow.com/questions/10600060/how-to-do-cross-join-in-r. – Parfait May 23 '16 at 17:01
  • It took around 30-35 mins and I added key=1 and then merge it it ran fast but it showed R ran out of memory. I tried increasing memory but it didn't work. – ash25 May 24 '16 at 21:47
  • I did this : res<-setkey(ToTest[,c(k=1,.SD)],k)[TestShipmentGroup[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]%>% filter ((Ship.Date >= (Ship.Date2-as.difftime(7, unit="days"))) & (Ship.Date < Ship.Date2) & (Origin.State == Origin.State2) & (Dest.State == Dest.State2)) – ash25 May 24 '16 at 21:50
  • Now I am running the same query against a very large data set and the because of the Cartesian product it is running out of memory. Is there any way that we can avoid the Cartesian product and still achieve the same results? Any help is appreciated. – ash25 Jul 14 '16 at 14:23
  • Hmmm...I wonder why I suggested Cross Join as Inner Join or regular merge on the States works, too. See updated code. – Parfait Jul 14 '16 at 15:32
  • When I run it, it gave me this error: Error in select(ToTest, Origin.State1, Dest.State1, Ship.Date1) : unused arguments (Origin.State1, Dest.State1, Ship.Date1) – ash25 Jul 14 '16 at 16:06
  • Did you rename columns in `ToTest` and `TestShipmentGroup1` per the first two lines? – Parfait Jul 14 '16 at 16:11