4

I have a table of about 50 000 rows, with four columns.

ID     Arrival             Departure             Gender

1   10/04/2015 23:14    11/04/2015 00:21           F
1   11/04/2015 07:59    11/04/2015 08:08           F
3   10/04/2017 21:53    30/03/2017 23:37           M
3   31/03/2017 07:09    31/03/2017 07:57           M
3   01/04/2017 01:32    01/04/2017 01:35           M
3   01/04/2017 13:09    01/04/2017 14:23           M
6   10/04/2015 21:31    10/04/2015 23:17           F
6   10/04/2015 23:48    11/04/2015 00:05           F
6   01/04/2016 21:45    01/04/2016 22:48           F
6   02/04/2016 04:54    02/04/2016 07:38           F
6   04/04/2016 18:41    04/04/2016 22:48           F
10  10/04/2015 22:39    11/04/2015 00:42           M
10  13/04/2015 02:57    13/04/2015 03:07           M
10  31/03/2016 22:29    01/04/2016 08:39           M
10  01/04/2016 18:49    01/04/2016 19:44           M
10  01/04/2016 22:28    02/04/2016 00:31           M
10  05/04/2017 09:27    05/04/2017 09:28           M 
10  06/04/2017 15:12    06/04/2017 15:43           M

This is a very small representation of the table. What I want to find out is, at the same time as each entry, how many others were present and then separate them by gender. So, say for example that at the time as the first presence of person with ID 1, person with ID 6 was present and person with ID 10 was present twice in the same interval. That would mean that at the same time, 2 other overlaps occurred. This also means that person with ID 1 has overlapped with 1 Male and 1 Female.

So its result should look like:

ID           Arrival            Departure         Males encountered        Females encountered
1       10/04/2015 23:14    11/04/2015 00:21             1                          1

How would I be able to calculate this? I have tried to work with foverlaps and have managed to solve this with Excel, but I would want to do it in R.

James Z
  • 12,209
  • 10
  • 24
  • 44
  • I cannot link your output to your sample data.. where would the `Males encountered == 1` come from? Your sample data seems not to contain any overlapping periods... – Wimpel Oct 05 '19 at 11:39
  • I have now made a small change so it would reflect the output. Thank you for pointing it out and sorry for the inconvenience. – Marco Badici Oct 05 '19 at 12:03

2 Answers2

1

Here's one possibility. This uses lubridate's interval and the int_overlaps function that finds date overlaps. That has a drawback though: Interval doesn't work with dplyr. So this version is just doing all the work manually in a for loop.

It starts by making a 1000 row random dataset that matches yours: each person arrives in a two year period and departs one or two days later.

It's taking about 24 seconds for 1000 to run so you can expect it to take a while for 50K! The for loop outputs the row number so you can see where it is though.

Any questions about the code, lemme know.

There must be a faster vectorised way but interval didn't seem to play nice with apply either. Someone else might have something quicker...

Final output looks like this

library(tidyverse)
library(lubridate)

#Sample data:
#(Date sampling code: https://stackoverflow.com/questions/21502332/generating-random-dates)
#Random dates between 2017 and 2019
x <- data.frame(
  ID = c(1:1000),
  Arrival = sample(seq(as.Date('2017/01/01'), as.Date('2019/01/01'), by="day"), 1000, replace = T),
  Gender = ifelse(rbinom(1000,1,0.5),'Male','Female')#Random Male female 50% probabiliity
)

#Make departure one or two days after arrival
x$Departure = x$Arrival + sample(1:2,1000, replace=T)


#Lubridate has a function for checking whether date intervals overlap
#https://lubridate.tidyverse.org/reference/interval.html

#So first, let's make the arrival and departure dates into intervals
x$interval <- interval(x$Arrival,x$Departure)


#Then for every person / row
#We want to know if their interval overlaps with the rest

#At the moment, dplyr doesn't play nice with interval
#https://github.com/tidyverse/dplyr/issues/3206

#So let's go through each row and do this manually
#Keep each person's result in list initially
gendercounts <- list()

#Check timing
t <- proc.time()

#Go through every row manually (sigh!
for(i in 1:nrow(x)){

  print(paste0("Row ",i))

  #exclude self (don't want to check date overlap with myself)
  overlapcheck <- x[x$ID != x$ID[i],]

  #Find out what dates this person overlaps with - can do all other intervals in one command
  overlapcheck$overlaps <- int_overlaps(x$interval[i],overlapcheck$interval)

  #Eyeball check that is finding the overlaps we want
  #Is this ID date overlapping? Tick
  #View(overlapcheck[overlapcheck$overlaps,])

  #Use dplyr to find out the number of overlaps for male and female
  #Keep only columns where the overlap is TRUE
  #Also drop the interval column first tho as dplyr doesn't like it... (not tidy!)
  gendercount <- overlapcheck %>% 
    select(-interval) %>% 
    filter(overlaps) %>% 
    group_by(Gender) %>%
    summarise(count = n()) %>% #Get count of observations for each overlap for each sex
    complete(Gender, fill = list(count = 0))#Need this to keep zero counts: summarise drops them otherwise


  #We want count for each gender in their own column, so make wide
  gendercount <- gendercount %>% 
    spread(key = Gender, value = count)

  #Store for turning into dataframe shortly
  gendercounts[[length(gendercounts)+1]] <- gendercount

}

#Dlyr command: turn list into dataframe
gendercounts <- bind_rows(gendercounts)

#End result. Drop interval column, order columns
final <- cbind(x,gendercounts) %>% 
  select(ID,Arrival,Departure,Gender,Male,Female)

#~24 seconds per thousand
proc.time()-t

Dan Olner
  • 11
  • 3
  • Do you think this will work for date times as well? Or just for date, containting year, month and day. – Marco Badici Oct 05 '19 at 13:02
  • Yup, should work for time too. This little example of an overlapping time on the same day works fine. You'll just need to make sure your date time columns are converted to date time objects. If you load in using the tidyverse's read_csv I think it should do that automatically. `one <- interval(ymd_hms("2010-08-01 00:50:50"),ymd_hms("2010-08-01 00:55:50")) two <- interval(ymd_hms("2010-08-01 00:53:50"),ymd_hms("2010-08-01 00:57:50")) int_overlaps(one,two) ` That's using [lubridates date time parsing](https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html). – Dan Olner Oct 05 '19 at 13:09
1

Here is a data.table solution using foverlaps.

First, notice that there's an error in your data:

ID           Arrival           Departure      Gender
3   10/04/2017 21:53    30/03/2017 23:37           M

The user arrived almost one month after he actually left. I needed to get rid of that data in order for foverlaps to run.

library(data.table)

dt <- data.table(df)
dt <- dt[Departure > Arrival, ]  # filter wrong cases

setkey(dt, "Arrival", "Departure")  # prepare for foverlaps
dt2 <- copy(dt)  # use a different dt, inherits the key

run foverlaps and then

  • filter (leave only) the cases where arrival of second person is before than ID and same user-cases.
  • Add a variable where we count the male simultaneous guests and
  • a variable where we count the female simultaneous guests, all grouped by ID and arrival

.

simultaneous <- foverlaps(dt, dt2)[i.Arrival <= Arrival & ID != i.ID,
                                       .(malesEncountered = sum(i.Gender == "M"),
                                         femalesEncountered = sum(i.Gender == "F")), 
                                       by = .(ID, Arrival)]

Join the findings of the previous command with our original table on ID and arrival

result <- simultaneous[dt, on = .(ID, Arrival)]

<EDIT>: Convert to zero the NAs in malesEncountered and femalesEncountered: </EDIT>

result[is.na(malesEncountered), malesEncountered := 0][
                 is.na(femalesEncountered), femalesEncountered := o]

set the column order to something nicer

setcolorder(result, c(1, 2, 5, 6, 3, 4))[]
PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • The last two columns in by.x should correspond to the 'start' and 'end' intervals in data.table 'x' and must be integer/numeric type. – Marco Badici Oct 05 '19 at 20:24
  • I get this error when I try to run the command for simultaneous. – Marco Badici Oct 05 '19 at 20:25
  • That means that your `Arrival` (and likely your `Departure`) columns are still text, not date. Try and study `lubridate::ymd_hm` and similar functions to convert your character columns into dates **before** `setkey` – PavoDive Oct 05 '19 at 22:24
  • Edit your question and paste the result of `dput(head(df))` and I'll be happy to address the problem you're having – PavoDive Oct 05 '19 at 23:24
  • Ok, so I have edited the code and have converted the Arrival and Departure into dates. The code runs without issues, but some rows remain as NAs instead of a count. – Marco Badici Oct 06 '19 at 09:47
  • @MarcoBadici I have edited my answer. If it works you may consider ticking the check mark to indicate it is an accepted answer – PavoDive Oct 06 '19 at 10:12