0

Okay, I have gone through couple of questions and I think I see a pattern in my data which is solvable (that is, I see that I can differentiate between dmy and mdy).

Let me give a bit of background for the dataset. The data set contains information for people travelling during the current quarantine. And so, as I scroll down the data set, I see most dates are entered in M-D-Y format while some are entered in dmy format.

So for this data set, (data set contains information till August). The ambiguity can be resolved in the following way:

  1. Dates cannot be in the future. So this resolves the difference between 08-12-2020 and 12-08-2020. Also, I only have the data till June.
  2. Days move quickly than months. If I see a sequence (no matter if its dmy or mdy) if I see a number changing every few rows (lets say 20) than I know that the number which is changing is a day and not a month. Example:

Changing numbers in a date

How would I correctly assign dates in this case?

Pss
  • 553
  • 4
  • 12
  • 1
    Using `as.Date` or `as.POSIXct` and a [previous answer](https://stackoverflow.com/a/60748268/3358272), can you get it to work? – r2evans Sep 03 '20 at 20:15
  • I believe that even if the parser can make sense of it with the first tried format, that does not necessarily mean it gets the correct date. If MDY is the expected default, any one of the dates in OPs DMY-formatted example above would be parsed just fine, but factually be wrong. It sounds like the only way to solve this is to rely on the observation that rows seem to be in chronological order and you can sanity check the observations in between "clear" cases such as 05/31/2020 and 06/13/2020 – alex_jwb90 Sep 03 '20 at 22:46
  • I was thinking that there would be some code where I first see which number gets changed per some rows. That number gets assigned as a "day" and then do the date parsing. I just don't know if there is a solution like that. – Pss Sep 03 '20 at 22:51
  • I think that this approach is a little tricky. Not knowing which block (second or third) of the date *is* the day, looking at the numeric difference to the previous row will not give much away. if the format changes from '2020/08/01' to '2020/01/08', you'd see a change of -7 on the second block and a change of +7 on the third. That won't solve your puzzle - not unless you find some anchor dates that can really only be interpreted one way - which then allows you that other logic deduction which I've tried to lay out in my answer below. – alex_jwb90 Sep 03 '20 at 23:58

1 Answers1

0

So, if I understand correctly, your data is in chronological order and you have some knowledge about the time-frame in which it was measured.

That given, here's my stab at creating scrambled data and then reparsing it:

library(dplyr)
library(lubridate)
library(zoo) # for na.locf function

# create some scrambled data to work with
df <- tibble(
    date_ground_truth = rep(seq(from = ymd('20190801'), to = ymd('20200730'), by = 1), each = 5)
  ) %>%
  mutate(
    date_inconsistent_chr = format(date_ground_truth, ifelse(runif(nrow(.), 0, 1) > 0.3, '%Y/%m/%d', '%Y/%d/%m'))
  )

# providing here the date range in which your observations lie. I only know that it maxes end of July 2020, so my end result has some remaining unknowns at the start
daterange_known_min <- NA_Date_
daterange_known_max <- ymd('20200730')

# initiate a cleaned df - for any date where we have a day > 12, we know that it can only be one format of YMD/YDM
df_recleaned <- df %>%
  mutate(
    date_parsed_ymd = as.Date(date_inconsistent_chr, '%Y/%m/%d'), # try YMD
    date_parsed_ydm = as.Date(date_inconsistent_chr, '%Y/%d/%m'), # try YDM
    
    date_parsed_deducted = case_when( # write out the clear cut cases
      day(date_parsed_ymd) > 12 ~ date_parsed_ymd,
      day(date_parsed_ydm) > 12 ~ date_parsed_ydm,
      date_parsed_ymd == date_parsed_ydm ~ date_parsed_ymd,
      T ~ NA_Date_
    )
  )

# we will run over the data until we can not deduct any more new dates from what we've learnt so far:

new_guesses_possible <- T
while(new_guesses_possible) {
  # how many dates did we already deduct?
  num_deducted_dates_before <- df_recleaned %>% filter(!is.na(date_parsed_deducted)) %>% nrow()
  
  # deduct more, based on our knowledge that the dates are chronological and within a certain time-frame
  df_recleaned <- df_recleaned %>%
    mutate(
      earliest_possible_date = coalesce(na.locf(date_parsed_deducted, na.rm = F), daterange_known_min),
      last_possible_date = coalesce(na.locf(date_parsed_deducted, fromLast = T, na.rm = F), daterange_known_max),
      
      ymd_guess_in_range = coalesce(date_parsed_ymd >= earliest_possible_date & date_parsed_ymd <= last_possible_date, F),
      ydm_guess_in_range = coalesce(date_parsed_ydm >= earliest_possible_date & date_parsed_ydm <= last_possible_date, F),
      
      date_parsed_deducted = case_when(
        # keep clear cases
        !is.na(date_parsed_deducted) ~ date_parsed_deducted,
        
        # if the ymd-guess falls between the last clear case and next clear case, take ymd
        ymd_guess_in_range & !ydm_guess_in_range ~ date_parsed_ymd,
        # same approach for ydm
        ydm_guess_in_range & !ymd_guess_in_range ~ date_parsed_ydm,
        
        # cover the cases where we don't know either the last or next clear parsed date
        # if one of the parsed dates falls before the "first possible date", take the other one.
        #   (if no daterange_known_min is given, these rows will result in NA and not do anything...)
        date_parsed_ymd < daterange_known_min ~ date_parsed_ydm,
        date_parsed_ydm < daterange_known_min ~ date_parsed_ymd,
        # inversely, if one parsed option falls after the "last possible date", ignore it.
        date_parsed_ymd > daterange_known_max ~ date_parsed_ydm,
        date_parsed_ydm > daterange_known_max ~ date_parsed_ymd
      )
    )
  
  # how many dates did we now deduct?
  num_deducted_dates_after <- df_recleaned %>% filter(!is.na(date_parsed_deducted)) %>% nrow()
  
  # do we need to go on?
  new_guesses_possible <- num_deducted_dates_after > 0 & num_deducted_dates_before != num_deducted_dates_after
}

# kick out all those extra columns :)
df_recleaned_final <- df_recleaned %>%
  select(
    -date_parsed_ymd, -date_parsed_ydm,
    -earliest_possible_date, -last_possible_date,
    -ymd_guess_in_range, -ydm_guess_in_range
  )

In my example, this fixes all dates after the first week of August 2019.
It might give you different results, if there are longer gaps in your data.

alex_jwb90
  • 1,663
  • 1
  • 11
  • 20
  • Thank you for this! I'm going to run this in some time and get back to you. – Pss Sep 03 '20 at 23:50
  • hope it helps. if you know the first possible date, please make sure to edit that line and set `daterange_known_min <- ymd('yyyymmdd')` :) – alex_jwb90 Sep 03 '20 at 23:59