1

I have been trying to do this for like 2-3 days but still could not find the answer. What I want to do is I have two dataframes x,y (given below a sample of them)

X
     Response.No Tab.No Survey.Date AC.Name Mandal.Name Village.Name
1         9530      1  2015-05-26      NA          NA           NA
2         6702      1  2015-05-30      NA          NA           NA
3        26744      1  2015-05-31      NA          NA           NA
4         8925      1  2015-06-03      NA          NA           NA
5        20242      1  2015-06-04      NA          NA           NA
6        21316      1  2015-06-04      NA          NA           NA
7        28056      1  2015-06-04      NA          NA           NA
8        12661      1  2015-06-05      NA          NA           NA
9        17187      1  2015-06-05      NA          NA           NA
10       28795      1  2015-06-05      NA          NA           NA

Y
     AC.Name   Mandal.Name      Village.Name Tab.No Survey.Start.Date Survey.End.Date
1  Nandigama Chanderlapadu        Punnavalli      1        2015-05-23      2015-05-27
2  Nandigama Chanderlapadu        Kasarabada      1        2015-05-30      2015-06-07
3  Nandigama Chanderlapadu     Kodavatikallu      1        2015-06-09      2015-06-28
4  Nandigama Chanderlapadu        Thurlapadu      1        2015-06-29      2015-07-13
5  Nandigama Chanderlapadu     Chanderlapadu      1        2015-07-14      2015-07-25
6  Nandigama Chanderlapadu            Popuru      2        2015-05-23      2015-05-27
7  Nandigama Chanderlapadu        Kandrapadu      2        2015-05-30      2015-06-08
8  Nandigama Chanderlapadu Vibhareethalapadu      3        2015-05-30      2015-06-04
9  Nandigama Chanderlapadu             Eturu      3        2015-06-10      2015-06-23
10 Nandigama Chanderlapadu      Bobbillapadu      3        2015-06-26      2015-07-03

i.e I want to match the x and y by column Tab.No but also make sure that x$Survey.Date lies between y$Survey.Start.Date and y$Survey.End.Date. And if both conditions are not satisfied the row must have N.A values . I've tried and searched google stackoverflow and R-Studio help but was unable to get the desired result.

Z
     Response.No Tab.No Survey.Date AC.Name      Mandal.Name   Village.Name
1         9530      1  2015-05-26      Nandigama Chanderlapadu Punnavalli
2         6702      1  2015-05-30      Nandigama Chanderlapadu Kasarabada
3        26744      1  2015-05-31      Nandigama Chanderlapadu Kasarabada
4         8925      1  2015-06-03      Nandigama Chanderlapadu Kasarabada
5        20242      1  2015-06-04      Nandigama Chanderlapadu Kasarabada
6        21316      1  2015-06-04      Nandigama Chanderlapadu Kasarabada
7        28056      1  2015-06-04      Nandigama Chanderlapadu Kasarabada
8        12661      1  2015-06-05      Nandigama Chanderlapadu Kasarabada
9        17187      1  2015-06-05      Nandigama Chanderlapadu Kasarabada
10       28795      1  2015-06-05      Nandigama Chanderlapadu Kasarabada

I've already checked : 1. How to merge two dataframes in R based on two conditions, matching column and within a range? 2. roll join with start/end window 3. Conditional merge/replacement in R

I've been trying to solve this using merge(), cbind() and match() to no avail. I am able to merge using only serial but without the date condition.

Thank you for the help

Community
  • 1
  • 1
user5544183
  • 75
  • 2
  • 6
  • You'll probably get more help if you paste (as an edit) the result of `dput(head(X))` along with the same thing for `Y`. That way we can use your data and not have to make some up that might turn out to be different in a material way. – Bryan Hanson Nov 09 '15 at 22:11
  • In `X`, are columns `add1` and `add2` always `N.A`? – Bryan Hanson Nov 09 '15 at 22:14
  • yes there is no data for add1, add2 in dataframe x hence they are N.A – user5544183 Nov 09 '15 at 22:15
  • 1
    Then why not drop columns `add1` and `add2` from `X`, merge on `serial`, and then do your date comparison and update `add1` and `add2` to NA if the date doesn't meet your criteria? Is there something I don't understand? – Bryan Hanson Nov 09 '15 at 22:29
  • 1
    wait let me put the actual data in this question that might be really helpfull – user5544183 Nov 09 '15 at 22:33
  • You changed the data and now I can't figure out which columns are the same in both tables. Is it Tab.No? Why did you choose something that is only 1 in the first table? – Pierre Lapointe Nov 09 '15 at 22:46
  • Using `dput` will put the data into the question in such a way that we can simply copy and paste it. – Bryan Hanson Nov 09 '15 at 22:46
  • Yes it is Tab.No it is a long dataframe so first 10 rows have Tab.No =1. – user5544183 Nov 09 '15 at 22:52
  • dput(head(X)) structure(list(Response.No = c(9530, 6702, 26744, 8925, 20242, 21316), Tab.No = c(1, 1, 1, 1, 1, 1), Survey.Date = structure(c(16581, 16585, 16586, 16589, 16590, 16590), class = "Date"), AC.Name = c(NA, NA, NA, NA, NA, NA), Mandal.Name = c(NA, NA, NA, NA, NA, NA), Village.Name = c(NA, NA, NA, NA, NA, NA)), .Names = c("Response.No", "Tab.No", "Survey.Date", "AC.Name", "Mandal.Name", "Village.Name" ), row.names = c(NA, 6L), class = "data.frame") – user5544183 Nov 09 '15 at 22:53
  • dput(head(y)) is too long to post in a comment. – user5544183 Nov 09 '15 at 22:54
  • 1
    You posted a good reproducible example at the beginning, it showed different scenarios in your data and what you were trying to accomplish. The heading of your data does not, as it is limited to only one scenario. The best solution would be to post the dput of the first example that you posted in the body of your answer. dput is much better than the column notation, as we can copy and past directly. Not bad for your first question though, these things come with time! – Chris Nov 09 '15 at 22:57
  • Don't forget you can thank one of the answerers by clicking the green arrow only you can see under the vote count for the answer you feel best fits your question. – Bryan Hanson Nov 09 '15 at 23:27
  • I still din't get my answer :( – user5544183 Nov 09 '15 at 23:35

2 Answers2

4

Data:

x <- data.table(Tab.No = c(1,1,2), Survey.Date = as.Date(c("2015-5-26","2015-6-15","2015-4-03")))
y <- data.table(AC.Name = c("abc","xyz","qwe"),
                Mandal.Name = c("def","pqr","rty"),
                Village.Name = c("def","pqr","rty"),
                Tab.No = c(1,1,2), 
                Survey.Start.Date = as.Date(c("2015-5-30","2015-5-01","2015-5-05")), 
                Survey.End.Date = as.Date(c("2015-6-30","2015-5-29","2015-6-30")))

I would merge X on Y first, test the condition, then left join back on the x array:

using data table:

merge(x,merge(y,x,by = "Tab.No")[Survey.Date >= Survey.Start.Date & Survey.Date <= Survey.End.Date, list(Tab.No,AC.Name,Mandal.Name,Village.Name,Survey.Date)], by = c("Tab.No","Survey.Date"), all.x = T)
   serial       date add1 add2
1:      1 2015-05-26  xyz  pqr
2:      1 2015-06-15  abc  def
3:      2 2015-04-03   NA   NA

Broken out a little more clearly if you are unfamiliar with data.table:

z <- merge(y,x,by = "Tab.No")[Survey.Date >= Survey.Start.Date & Survey.Date <= Survey.End.Date, list(Tab.No,AC.Name,Mandal.Name,Village.Name,Survey.Date)]
merge(x,z, by = c("Tab.No","Survey.Date"), all.x = T)

Note that I ignored the NA columns in your x frame, they are unnecessary at the beginning

Chris
  • 6,302
  • 1
  • 27
  • 54
  • Chris, I tried your solution using data table and i get error object 'Tab.No' not found. – user5544183 Nov 09 '15 at 23:36
  • @user5544183 are you using the `data.table` package? Make sure that you have called `library(data.table)`, and then `setDT(x)`, `setDT(y)` – Chris Nov 09 '15 at 23:40
  • yes forgot to load the data.table package. BTW i get the following error when i run your solution. Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x), : Join results in 150 rows; more than 30 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. – user5544183 Nov 09 '15 at 23:53
  • @user5544183 for the 2 line solution, do you get the error on the `z<-` line or the `merge(x,z ...)` line – Chris Nov 10 '15 at 17:23
3

Here's how to do it with dplyr.

inner_join(X[,1:3],Y, by=c("Tab.No"))%>%
mutate(AC.Name = ifelse(Survey.Date>=Survey.Start.Date & Survey.Date<=Survey.End.Date, AC.Name ,NA),
Mandal.Name = ifelse(Survey.Date>=Survey.Start.Date & Survey.Date<=Survey.End.Date, Mandal.Name ,NA),
Village.Name = ifelse(Survey.Date>=Survey.Start.Date & Survey.Date<=Survey.End.Date, Village.Name ,NA))%>%
group_by(Tab.No)%>%
filter(!is.na(AC.Name)|n()==1)%>%
select(Response.No,Tab.No,Survey.Date,AC.Name,Mandal.Name,Village.Name)

result

   Response.No Tab.No Survey.Date   AC.Name   Mandal.Name Village.Name
         (int)  (int)      (date)     (chr)         (chr)        (chr)
1         9530      1  2015-05-26 Nandigama Chanderlapadu   Punnavalli
2         6702      1  2015-05-30 Nandigama Chanderlapadu   Kasarabada
3        26744      1  2015-05-31 Nandigama Chanderlapadu   Kasarabada
4         8925      1  2015-06-03 Nandigama Chanderlapadu   Kasarabada
5        20242      1  2015-06-04 Nandigama Chanderlapadu   Kasarabada
6        21316      1  2015-06-04 Nandigama Chanderlapadu   Kasarabada
7        28056      1  2015-06-04 Nandigama Chanderlapadu   Kasarabada
8        12661      1  2015-06-05 Nandigama Chanderlapadu   Kasarabada
9        17187      1  2015-06-05 Nandigama Chanderlapadu   Kasarabada
10       28795      1  2015-06-05 Nandigama Chanderlapadu   Kasarabada

data

X<-read.table(text="     Response.No Tab.No Survey.Date AC.Name Mandal.Name Village.Name
9530      1  2015-05-26      NA          NA           NA
6702      1  2015-05-30      NA          NA           NA
26744      1  2015-05-31      NA          NA           NA
8925      1  2015-06-03      NA          NA           NA
20242      1  2015-06-04      NA          NA           NA
21316      1  2015-06-04      NA          NA           NA
28056      1  2015-06-04      NA          NA           NA
12661      1  2015-06-05      NA          NA           NA
17187      1  2015-06-05      NA          NA           NA
28795      1  2015-06-05      NA          NA           NA
", header=T,stringsAsFactors =F)

Y<-read.table(text="AC.Name   Mandal.Name      Village.Name Tab.No Survey.Start.Date Survey.End.Date
Nandigama Chanderlapadu        Punnavalli      1        2015-05-23      2015-05-27
Nandigama Chanderlapadu        Kasarabada      1        2015-05-30      2015-06-07
Nandigama Chanderlapadu     Kodavatikallu      1        2015-06-09      2015-06-28
Nandigama Chanderlapadu        Thurlapadu      1        2015-06-29      2015-07-13
Nandigama Chanderlapadu     Chanderlapadu      1        2015-07-14      2015-07-25
Nandigama Chanderlapadu            Popuru      2        2015-05-23      2015-05-27
Nandigama Chanderlapadu        Kandrapadu      2        2015-05-30      2015-06-08
Nandigama Chanderlapadu Vibhareethalapadu      3        2015-05-30      2015-06-04
Nandigama Chanderlapadu             Eturu      3        2015-06-10      2015-06-23
Nandigama Chanderlapadu      Bobbillapadu      3        2015-06-26      2015-07-03
", header=T,stringsAsFactors =F)

X$Survey.Date <-as.Date(X$Survey.Date)
Y$Survey.Start.Date <-as.Date(Y$Survey.Start.Date)
Y$Survey.End.Date <-as.Date(Y$Survey.End.Date)
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
  • I have dplyr package but when I run your solution instead of getting AC.Name Mandal.Name Village.Name as (chr) I am getting it as int and numbers in the dataframe in the said columns . What seems to be the problem ? – user5544183 Nov 09 '15 at 23:45
  • Thank you so much, I got it working the problem was it was taking strings as factors and changed it back to char. – user5544183 Nov 10 '15 at 00:12