-1

I have a dataset as follows:

19/9/1997 22/9/1997 23/9/1997 24/9/1997 25/9/1997 26/9/1997 29/9/1997 30/9/1997

35440 35471 35499 35591 35621 35652 35683 35713

13/10/1997 14/10/1997 15/10/1997 16/10/1997 17/10/1997 20/10/1997 21/10/1997 22/10/1997 23/10/1997 24/10/1997 27/10/1997 28/10/1997 29/10/1997 30/10/1997 31/10/1997

35500 35531 35561 35592 35622 35714 35745 35775

13/11/1997 14/11/1997 17/11/1997 18/11/1997 19/11/1997 20/11/1997 21/11/1997 24/11/1997 ...

The Data that should be here are (for reproduction as requested) 19/9/1997 22/9/1997 23/9/1997 24/9/1997 25/9/1997 26/9/1997 29/9/1997 30/9/1997 10/01/1997 10/02/1997 10/03/1997 10/06/1997 10/07/1997 10/08/1997 10/09/1997 10/10/1997 13/10/1997 14/10/1997 15/10/1997 16/10/1997 17/10/1997 20/10/1997 21/10/1997 22/10/1997 23/10/1997 24/10/1997 27/10/1997 28/10/1997 29/10/1997 30/10/1997 31/10/1997 11/03/1997 11/04/1997 11/05/1997 11/06/1997 11/07/1997 11/10/1997 11/11/1997 11/12/1997 13/11/1997 14/11/1997 17/11/1997 18/11/1997 19/11/1997 20/11/1997 21/11/1997 24/11/1997

I have 5,149 rows of dates where there are numbers in places of dates. I tried fixing the missing dates with this: ATTEMPT 1 BEFORE REVISION:

  rm (list = ls(all=TRUE))
    graphics.off()
    library(readxl)
    Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
    x<-Dates[,1]
    library(date)
    library(datetime)
    ans <- Reduce(function(prev, curr) {
      f1 <- as.Date(curr, "%d/%m/%Y")
      f2 <- as.Date(curr, "%m/%d/%Y")
      if (is.na(f1)) return(f2)
      if (is.na(f2)) return(f1)
      if (prev < f1 && prev < f2) return(min(f1, f2))
      if (prev < f1) return(f1)
      if (prev < f2) return(f2)
    }, x[-1], init=as.Date(x[1], "%d/%m/%Y"), accumulate=TRUE)

    as.Date(ans, origin="1970-01-01")

But I am getting the following error:

+ }, x[-1], init=as.Date(x[1], "%d/%m/%Y"), accumulate=TRUE)
Error in Reduce(function(prev, curr) { : object 'x' not found
> 
> as.Date(ans, origin="1970-01-01")
Error in as.Date(ans, origin = "1970-01-01") : object 'ans' not found

Any suggestions will be highly appreciated.

OK AS PER ADVICE I REVISED THE CODE ATTEMPT 2 AFTER REVISION

    > rm (list = ls(all=TRUE))
    > graphics.off()
    > library(readxl)
    > Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
    > dput(head(Dates))
    structure(list(Date = c("33274", "33302", "33394", "33424", "33455", 
    "33486")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", 
    "data.frame"))
    > x<-Dates[[1]] 
    > library(date) 
    > library(datetime) 

    Attaching package: ‘datetime’

    The following object is masked from ‘package:date’:

        as.date

    > dates <- as.Date(x, format="%d/%m/%Y")
    > dput(head(dates))
    structure(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), class = "Date")
    > head(dates,10)
     [1] NA           NA           NA           NA           NA           NA           NA          
     [8] "1991-05-13" "1991-05-14" "1991-05-15"
As you can see I have lost the corrupted dates completely

Today on 28th I tried again

> rm (list = ls(all=TRUE))
> graphics.off()
> library(readxl)
> Dates <- read_excel("F:/OneDrive - University of Tasmania/Mardi Meetings/Dataset/Dates.xlsx")
> x<-Dates[[1]] 
> 
> library(date) 
> library(datetime) 

Attaching package: ‘datetime’

The following object is masked from ‘package:date’:

    as.date

> formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
> dates <- as.Date(rep(NA, length(x)))
> for (fmt in formats) {
+   nas <- is.na(dates)
+   dates[nas] <- as.Date(as.integer(x[nas], format=fmt))
+ }
Error in as.Date.numeric(as.integer(x[nas], format = fmt)) : 
  'origin' must be supplied
In addition: Warning message:
In as.Date(as.integer(x[nas], format = fmt)) : NAs introduced by coercion
> dates <- as.Date(x, format="%d/%m/%Y")
> head(dates)
[1] NA NA NA NA NA NA
> head(dates, 10)
 [1] NA           NA           NA           NA           NA           NA           NA          
 [8] "1991-05-13" "1991-05-14" "1991-05-15"
Mihan
  • 189
  • 2
  • 15
  • 2
    Is this question about "corrupted dates" or is it about `'x' not found`? Since we don't have your data, it's rather difficult to troubleshoot that side of things. Perhaps you can provide the output from `dput(head(x))` before the call to `Reduce`? – r2evans Sep 12 '18 at 03:17
  • 1
    Hi,thank you for your response. X is a vector of dates. Just before the code, on top, you will see listed dates. They are rows in one column? As you can see, after 30/09/97 the following row is 3559. See that is why they are corrupted. The date information is collected from datastream. You have a few rows in dates format and some dates just turn to numbers. I need all the rows in dates, in order. The dates are not chronological. They represent some days in a week then some days not there. Again some days in next week. – Mihan Sep 13 '18 at 07:25
  • 4
    Please don't post code like `rm(list = ls())` in your question. We want to be able to copy and run your code without worrying that it will destroy work in progress on our computers. – Gregor Thomas Sep 27 '18 at 21:28
  • 2
    @Gregor, sometimes I wonder if I should be trying code like this in an R-sandbox ... is there an easy way to temporarily shift `.GlobalEnv` out of the way? Not foolproof, but closer ... :-) – r2evans Sep 27 '18 at 22:41
  • 2
    Not an easy way I know of... I sometimes keep two RStudio instances open, one for work and one for my stack overflow addiction. – Gregor Thomas Sep 27 '18 at 23:34

1 Answers1

4

You need none of the packages you've loaded, nor do you need to use Reduce, as functions we're using here are naturally "vectorized".

Here's a sample of your data. (A good question includes data in an easily copied format such as this.)

x <- c("19/9/1997", "22/9/1997", "23/9/1997", "24/9/1997", "25/9/1997",
       "26/9/1997", "29/9/1997", "30/9/1997",
       "35440", "35471", "35499", "35591", "35621",
       "35652", "35683", "35713")

dates <- as.Date(x, format="%d/%m/%Y")
dates
#  [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
#  [6] "1997-09-26" "1997-09-29" "1997-09-30" NA           NA          
# [11] NA           NA           NA           NA           NA          
# [16] NA          

Not surprisingly, the second-half of the dates are not recognized given format="%d/%m/%Y". You mentioned the use of "%m/%d/%Y" in your question, so we can (1) do a literal second-pass for this format (un-utilized with this example, but still relevant for your work?):

dates[is.na(dates)] <- as.Date(x[is.na(dates)], format="%m/%d/%Y")

where [is.na(dates)] only works on the un-converted elements.

(2) If we have more than one other format, you can always use a vector of them and loop over them. (For this, I'll start over, since this loop would replace/augment the first steps above.)

formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")
dates <- as.Date(rep(NA, length(x)))
for (fmt in formats) {
  nas <- is.na(dates)
  dates[nas] <- as.Date(x[nas], format=fmt)
}
dates
#  [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
#  [6] "1997-09-26" "1997-09-29" "1997-09-30" NA           NA          
# [11] NA           NA           NA           NA           NA          
# [16] NA          

This still leaves us with NAs for the integer-looking ones. For these you need to specify the origin= to be able to figure it out (as well as converting to an integer). R typically works with an origin of "1970-01-01", which you can confirm with

as.integer(Sys.Date())
# [1] 17787
Sys.Date() - 17787
# [1] "1970-01-01"

but it appears that your dates have an origin of "1900-01-01", I think that's Excel's default storage of dates (but it doesn't matter here):

x[9]    # the first integer-looking element
# [1] "35440"
dates[1] - as.integer(x[9])
# [1] "1900-09-08"

(I'm assuming that your dates are from the same relative period of time.)

From here:

nas <- is.na(dates)
dates[nas] <- as.Date(as.integer(x[nas]), origin="1900-01-01")
dates
#  [1] "1997-09-19" "1997-09-22" "1997-09-23" "1997-09-24" "1997-09-25"
#  [6] "1997-09-26" "1997-09-29" "1997-09-30" "1997-01-12" "1997-02-12"
# [11] "1997-03-12" "1997-06-12" "1997-07-12" "1997-08-12" "1997-09-12"
# [16] "1997-10-12"

(Working on the indices of only NA elements is relatively efficient in that it only works on and replaces the not-yet-matched entries. If there is nothing left when it gets to another call to as.Date, it does still call it but with an argument of length 0, with which the function works rather efficiently. I don't think adding a conditional of if (any(nas)) ... would help, but if there are further methods you need that might be more "expensive", you can consider it.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • `Dates <- read_excel("F:/Dates.xlsx") x<-Dates[,1] library(date) library(datetime) dates <- as.Date(x, format="%d/%m/%Y")` gives me the error `Error in as.Date.default(x, format = "%d/%m/%Y") do not know how to convert 'x' to class “Date”` . So I could not proceed further. Please advise how to move to the net block and get past this error. – Mihan Sep 14 '18 at 18:59
  • 4
    R.Islam, I now see, it's obvious to me. `readxl::read_excel` returns a tibble, which is slightly different than a `data.frame`: when indexed with `[,1]`, a frame will (almost) always return a vector, a tibble will always return a (single-column) frame. Try `mtcars[,1]` versus `tbl_df(mtcars)[,1]`. In circumstances like this, *you need to look at your object* to know what is going on, where `class(x)` likely says `data.frame`, when `as.Date` only accepts vectors. (I'll say again: ***reproducible data sample***. The output from `dput(head(x))` would have made it obvious to many of us.) – r2evans Sep 14 '18 at 21:10
  • But I should have seen that in the example you did provide, so this is just as much on me for not reading beyond your literal question. – r2evans Sep 14 '18 at 21:13
  • 1
    So one answer is to replace `x<-Dates[,1]` with `x<-Dates[[1]]`. – r2evans Sep 18 '18 at 04:04
  • > head(dates,10) [1] NA NA NA NA NA NA [7] NA "1991-05-13" "1991-05-14" "1991-05-15" ......As you can see all the corrupted dates now become NAs. So how can i get the dates in place of NA. I guess this is progressing. – Mihan Sep 18 '18 at 20:12
  • 1
    R.Islam, I'm not really certain any more what the underlying problem is. Is it that your function doesn't know how to deal with `data.frame`/`tibble`? Or is it that you don't know how to use `Reduce` (which is the wrong function for this, btw) to convert your column to a `Date` type? Are you still using `Date[,1]` on a `tibble`? – r2evans Sep 18 '18 at 20:38
  • Indeed, I am new in r. However I tried your solution that turns corrupted dates into NAs – Mihan Sep 20 '18 at 00:12
  • 1
    Look, there really is nothing more I can do until you provide reproducible data. I know you *say* that you've provided it, but you haven't. Please provide the output from `dput(head(Dates))` (by editing your question and adding it), and I'll take a look. – r2evans Sep 20 '18 at 03:53
  • Edited up as you suggested. Please advise now for Block 2 – Mihan Sep 21 '18 at 18:46
  • 1
    @r2evans I never realized that we could use vector of formats, eye opener! – rahul Sep 21 '18 at 19:06
  • 1
    R.Islam, my code works perfectly for the sample `Dates` you included. Note that they are still `NA` at the end of the `for` loop, but the trailing cleanup (that looks for the integer-looking date strings) picks them up. – r2evans Sep 21 '18 at 20:59
  • 1
    I see where there might be confusion in my answer. There are eight lines of code to use: from `formats <- c(...); dates <- as.Date(rep(NA,...)); for (fmt in formats) { ...}; nas <- is.na(dates); dates[nas] <- as.Date(as.integer(x[nas]),...)`. Don't forget those last two, they are the "clean-up" for the integer-looking strings. – r2evans Sep 21 '18 at 21:37
  • Hi, what should i put in place of c(...)...what goes in place of the dots? thank you for your help – Mihan Sep 25 '18 at 20:59
  • 1
    I was shortening code that is already in the answer. That first line is `formats <- c("%m/%d/%Y", "%d/%m/%Y", "%Y/%m/%d")`. – r2evans Sep 25 '18 at 21:20
  • Hi, @r2evans, Please see the edited post above as per your suggestion....I am stuck – Mihan Sep 27 '18 at 21:13
  • 3
    @R.Islam The main suggestion in these comments seems to be that you should post reproducible sample data. With `dput()`. So it's copy/pasteable and reflects your actual data structure. **He's asked for it at least 3 times.** If I were r2evans, I would feel pretty stuck too. If you don't know how to provide reproducible data, please [see this FAQ on making reproducible examples in R](https://stackoverflow.com/a/5963610/903061). The one time you did use `dput`, it only captured integer-style dates, `"33274", "33302", "33394", "33424", "33455", "33486"`. Please find a representative subset. – Gregor Thomas Sep 27 '18 at 21:35
  • Hi, I am sorry if I could not reproduce the data as you wanted. As you can see even after using dput some dates are shown in integers. My data is collected from datastream and in between dates i have these integers. These have been my problem to begin with and I do not know how else to handle this. – Mihan Jan 01 '19 at 04:32
  • 1
    The code I provided correctly handles the data as you have provided it, but the more I think about it, the more I think the symptom you show (dates and integers) illustrates some fundamental problem in your spreadsheet: if you have some cells formatted as dates and some not, it will give inconsistent results. Try highlighting everything relevant within the spreadsheet and remove all formatting (in win10/excel, *Home>Editing>Clear>Clear Formats*). You'll see what Excel thought were dates shift to integers, everything remaining is unrecognized, meaning you need to normalize things. – r2evans Jan 02 '19 at 18:42
  • 1
    One possibility is that you learn what format Excel expects for dates and change all values to this format. For mine, it expects `yyyy-mm-dd`, anything else is a verbatim string. Another possibility is that you tell Excel to never try to do anything by converting the format from "Generic" to "Text", and then export it. You may need to back-convert your integers back to "Date", them copy/paste as strings to force it to be Text ... and then use R to read the data again. Bottom line, you have mixed formats that are being stored in different ways. (Excel thinks it's helping you ...) – r2evans Jan 02 '19 at 18:44
  • @r2evans, thank you so much my friend. This solved the problem. – Mihan Jun 15 '21 at 01:52