0

I wish to merge two data sets. The first data set contains information CO2 concentrations in an instrument at specific times. The second data set contains information on when a measurement was started and ended.

I would like to combine the two dataframes so that times from data set 1 that are within the time interval (start_date_time to end_date_time) in data set 2 end up on the same row. That is not very well explained, so purhaps it is easier to look at the data below.

I'm new to R so, any help is much appreciated.

The data looks similar to what is shown below.

Data set 1 
 co2               date_time
 0.2               2012-05-01 10:23:20
 0.2               2012-05-01 10:23:23
 0.1               2012-05-01 10:23:26 
 0.6               2012-05-01 10:23:29
 0.1               2012-05-01 10:23:32
 0.1               2012-05-01 10:23:35
 0.6               2012-05-01 10:23:38
 0.6               2012-05-01 10:23:41
 0.2               2012-05-01 10:23:44
 0.2               2012-05-01 10:23:47
 0.3               2012-05-01 10:23:50 
 0.3               2012-05-01 10:23:53
 0.4               2012-05-01 10:23:56
 0.4               2012-05-01 10:23:59
 0.3               2012-05-01 10:24:02
 0.3               2012-05-01 10:24:05
 0.4               2012-05-01 10:24:08
 0.4               2012-05-01 10:24:11
 0.6               2012-05-01 10:24:14
 0.6               2012-05-01 10:24:17
 0.7               2012-05-01 10:24:20
 0.7               2012-05-01 10:24:23
 0.3               2012-05-01 10:24:26
 0.7               2012-05-01 10:24:29
 0.3               2012-05-01 10:24:32
 0.4               2012-05-01 10:24:35
 0.2               2012-05-01 10:24:38
 0.3               2012-05-01 10:24:41
 0.3               2012-05-01 10:24:44
 0.3               2012-05-01 10:24:47
 0.4               2012-05-01 10:24:50
 0.1               2012-05-01 10:24:53
 0.1               2012-05-01 10:24:56

Data set 2
 jar          start_date_time        end_date_time
 1               2012-05-01 10:23:18    2012-05-01 10:23:30
 2               2012-05-01 10:23:31    2012-05-01 10:23:40    
 3               2012-05-01 10:23:41    2012-05-01 10:23:52
 4               2012-05-01 10:23:53    2012-05-01 10:24:01
 5               2012-05-01 10:24:02    2012-05-01 10:24:12
 6               2012-05-01 10:24:13    2012-05-01 10:24:20
 7               2012-05-01 10:24:30    2012-05-01 10:23:40
 8               2012-05-01 10:24:41    2012-05-01 10:24:52

Expected data set
jar start_date_time    end_date_time      co2   date_time
1   05-01-12 10:23:18  05-01-12 10:23:30  0.2   05-01-12 10:23:20
1   05-01-12 10:23:18  05-01-12 10:23:30  0.2   05-01-12 10:23:23
1   05-01-12 10:23:18  05-01-12 10:23:30  0.1   05-01-12 10:23:26
1   05-01-12 10:23:18  05-01-12 10:23:30  0.6   05-01-12 10:23:29
2   05-01-12 10:23:31  05-01-12 10:23:40  0.1   05-01-12 10:23:32
2   05-01-12 10:23:31  05-01-12 10:23:40  0.1   05-01-12 10:23:35
2   05-01-12 10:23:31  05-01-12 10:23:40  0.6   05-01-12 10:23:38
3   05-01-12 10:23:41  05-01-12 10:23:52  0.6   05-01-12 10:23:41
3   05-01-12 10:23:41  05-01-12 10:23:52  0.2   05-01-12 10:23:44
3   05-01-12 10:23:41  05-01-12 10:23:52  0.2   05-01-12 10:23:47
3   05-01-12 10:23:41  05-01-12 10:23:52  0.3   05-01-12 10:23:50
4   05-01-12 10:23:53  05-01-12 10:24:01  0.3   05-01-12 10:23:53
4   05-01-12 10:23:53  05-01-12 10:24:01  0.4   05-01-12 10:23:56
4   05-01-12 10:23:53  05-01-12 10:24:01  0.4   05-01-12 10:23:59
5   05-01-12 10:24:02  05-01-12 10:24:12  0.3   05-01-12 10:24:02
5   05-01-12 10:24:02  05-01-12 10:24:12  0.3   05-01-12 10:24:05
5   05-01-12 10:24:02  05-01-12 10:24:12  0.4   05-01-12 10:24:08
5   05-01-12 10:24:02  05-01-12 10:24:12  0.4   05-01-12 10:24:11
6   05-01-12 10:24:13  05-01-12 10:24:20  0.6   05-01-12 10:24:14
6   05-01-12 10:24:13  05-01-12 10:24:20  0.6   05-01-12 10:24:17
6   05-01-12 10:24:13  05-01-12 10:24:20  0.7   05-01-12 10:24:20
7   05-01-12 10:24:30  05-01-12 10:24:40  0.3   05-01-12 10:24:32
7   05-01-12 10:24:30  05-01-12 10:24:40  0.4   05-01-12 10:24:35
7   05-01-12 10:24:30  05-01-12 10:24:40  0.2   05-01-12 10:24:38
8   05-01-12 10:24:41  05-01-12 10:24:50  0.3   05-01-12 10:24:41
8   05-01-12 10:24:41  05-01-12 10:24:50  0.3   05-01-12 10:24:44
8   05-01-12 10:24:41  05-01-12 10:24:50  0.3   05-01-12 10:24:47
8   05-01-12 10:24:41  05-01-12 10:24:50  0.4   05-01-12 10:24:50

So far I have tried the following:

  sqldf("select * from df1 inner join df2 on (df1.date_time between df2.start_date_time and df2.end_date_time)")

R message: Error in match.fun(asfn) : 'c("as.hms", "as.difftime")' is not a function, character or symbol

and

sqldf("select * from df1 date_time inner join df2 start_date_time on (df1.date_time > df2.start_date_time and d1.date_time <= df2.end_date_time)")

R message: Error in result_create(conn@ptr, statement) : no such column: ch4.new_date_time

and

do.call(rbind, apply(df1, 1, function(x) {
  if(length(idx <- which(x["date_time"] >= df2$start_date_time & x["date_time"] <= df2$end_date_time)) > 0) {
    cbind(rbind(x), df2[idx,])
  }
}))

and as suggested by AEF

crossing(df1, df2) %>% 
filter(date_time >= start_date_time, date_time < end_date_time)

R message: [1] campaign date_time ch4_ppm co2_ppm jar_camp
[6] jar campaign1 year month day
[11] date bvoc start date_time bvoc end date_time bvoc start time bvoc end time
[16] start_date_time end_date_time ch4 start time ch4 end time o2 start date_time
[21] o2 end date_time o2 start time o2 end time comments
<0 rows> (or 0-lengt row.names)

Tiptop
  • 533
  • 5
  • 19
  • Have a look at Merge by Range in R - https://stackoverflow.com/q/11892241/10488504 or Matched Range Merge in R - https://stackoverflow.com/q/11851240/10488504 . – GKi Jul 18 '19 at 11:13

1 Answers1

2

I think you can do the following:

Read your data:

library(tidyverse)
library(lubridate)


ds1 <- 
str_replace_all(" co2_conc.         date_time
 0.2               2012-05-01 10:23:20
 0.2               2012-05-01 10:23:23
 0.1               2012-05-01 10:23:26 
 0.6               2012-05-01 10:23:29
 0.1               2012-05-01 10:23:32
 0.1               2012-05-01 10:23:35
 0.6               2012-05-01 10:23:38
 0.6               2012-05-01 10:23:41
 0.2               2012-05-01 10:23:44
 0.2               2012-05-01 10:23:47
 0.3               2012-05-01 10:23:50 
 0.3               2012-05-01 10:23:53
 0.4               2012-05-01 10:23:56
 0.4               2012-05-01 10:23:59
 0.3               2012-05-01 10:24:02
 0.3               2012-05-01 10:24:05
 0.4               2012-05-01 10:24:08
 0.4               2012-05-01 10:24:11
 0.6               2012-05-01 10:24:14
 0.6               2012-05-01 10:24:17
 0.7               2012-05-01 10:24:20
 0.7               2012-05-01 10:24:23
 0.3               2012-05-01 10:24:26
 0.7               2012-05-01 10:24:29
 0.3               2012-05-01 10:24:32
 0.4               2012-05-01 10:24:35
 0.2               2012-05-01 10:24:38
 0.3               2012-05-01 10:24:41
 0.3               2012-05-01 10:24:44
 0.3               2012-05-01 10:24:47
 0.4               2012-05-01 10:24:50
 0.1               2012-05-01 10:24:53
 0.1               2012-05-01 10:24:56", " {2,}", ";") %>% 
  read_delim(delim = ";", trim_ws = TRUE)

ds2 <-
str_replace_all("jar_no          start_date_time        end_date_time
 1               2012-05-01 10:23:18    2012-05-01 10:23:30
 2               2012-05-01 10:23:31    2012-05-01 10:23:40
 3               2012-05-01 10:23:41    2012-05-01 10:23:52
 4               2012-05-01 10:23:53    2012-05-01 10:24:01
 5               2012-05-01 10:24:02    2012-05-01 10:24:12
 6               2012-05-01 10:24:13    2012-05-01 10:24:20
 7               2012-05-01 10:24:30    2012-05-01 10:23:40
 8               2012-05-01 10:24:41    2012-05-01 10:24:52", " {2,}", ";") %>% 
  read_delim(delim = ";", trim_ws = TRUE)

The actual computation: Compute the cartesian product of the rows and filter

crossing(ds1, ds2) %>% 
  filter(date_time >= start_date_time,
         date_time < end_date_time)

However, the result is not exactly the same you posted as "expected result". I'm not sure if I misunderstood or if your example is incorrect. E.g. I cannot understand where the 4th line of the expected result comes from. There seem to be no corresponding start/end dates in dataset 2.

AEF
  • 5,408
  • 1
  • 16
  • 30
  • Thanks for your comment! You're right, there were some mistakes in the numbers I posted - that should be corrected now. I tried to use your code, however it did not give me the expected result - see above – Tiptop Jul 23 '19 at 09:47
  • Where is the difference? I just re-run it and it seems to be exatcly what you posted now as expected result. (Granted, the column order is different, but this can be easily adapted). – AEF Jul 23 '19 at 10:56
  • There were some date_times that ended up in the wrong interval. Anyway, I ended up doing the calculation in several steps instead. First I used setkey and then merged the two data tables by roll= "nearest". Having the data in one data table, I was able to do an indx based on time criterias. Not pretty, but working. Thanks a lot for your help though, I'm sure your code works, I just got stuck using it. – Tiptop Jul 24 '19 at 11:23