3

I have data with a timestamp of the form %m%d%Y with no leading zeroes.

Timestamp sample:

112001
1112001

Desired parsing

January 1 2001
January 11 2001 or November 1 2001 based on context

The timestamps are in sequential order. Is it possible to parse this data?

kilojoules
  • 9,768
  • 18
  • 77
  • 149

3 Answers3

2

It is possible, but I think there needs to be some prior work. This follows the same premise as @hrbrmstr, which is I think is what needs to be done to be able to parse these dates.

> x <- c("112001", "1112001")
> x1 <- ifelse(substring(x, 1, 1) != 0, paste0(0, x), x)
> x2 <- ifelse(nchar(x1) == 7 & substring(x1, 3, 3) != 0, 
               paste0(substring(x1, 1, 2), 0, substring(x1, 3)), x1)
> library(lubridate)
> parse_date_time(x2, "mdy")
[1] "2001-01-01 UTC" "2001-01-11 UTC"
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Haha, almost. But I get the year 2012 for the first element where they should both be 2001. But I can tell you're about to blow my mind. – Rich Scriven Sep 13 '14 at 02:28
  • 1
    Ok Ok, the engine won't let me insert at multiple zero-width positions. `parse_date_time(gsub('^(?=.{6,7}$)', '0', perl=T, gsub('^\\d\\K(?!\\d{6})', '0', x, perl=T)), 'mdy')` – hwnd Sep 13 '14 at 02:42
  • Mind officially blown. You should post that as an answer. – Rich Scriven Sep 13 '14 at 02:46
  • 2
    Just for fun, of course your answer is more concise for this (+1). By the way thanks for the feedback on my question I asked http://stackoverflow.com/questions/25800042/overlapping-matches-in-r, you could post that as an answer and Ill upvote it. – hwnd Sep 13 '14 at 02:48
  • @hwnd, thanks. BTW, your regex skills are top notch. – Rich Scriven Sep 13 '14 at 02:59
  • Thanks, I appreciate the comment. – hwnd Sep 13 '14 at 03:08
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/61154/discussion-between-hwnd-and-richard-scriven). – hwnd Sep 13 '14 at 03:08
  • This doesn't appear to work in all cases (see my answer [here](http://stackoverflow.com/a/25391340/271616)). – Joshua Ulrich Sep 13 '14 at 14:51
  • @JoshuaUlrich - I saw that answer. Not really sure what to do, as if I change it the answer will be just like yours and I don't want to take your work. – Rich Scriven Sep 13 '14 at 17:18
  • Yeah, I think this question is a duplicate, but I don't want to vote as such, since my reputation level will automatically close it. I was hoping for some feedback on my comment on the question. – Joshua Ulrich Sep 13 '14 at 17:44
1

This would be the basic logic handling for those date strings by length. You'll need to add logic for the "context", given that we have no idea how these are structured. I'm putting them in a vector for example:

dates <- c(112001, 1112001)

lapply(dates, function(x) {

  x <- as.character(x) 

  if (nchar(x) == 6) {
    as.Date(sprintf("0%s0%s%s", substr(x,1,1), substr(x,2,2), substr(x,3,6)), format="%m%d%Y")
  } else if (nchar(x) == 7) {
    as.Date(sprintf("0%s%s%s", substr(x,1,1), substr(x,2,3), substr(x,4,7)), format="%m%d%Y")    
  } else {
    as.Date(x, format="%m%d%Y")    
  }

})

## [[1]]
## [1] "2001-01-01"
## 
## [[2]]
## [1] "2001-01-11"
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • I used `lapply` so you'd see they are actual "date" objects. I'd (personally) probably use `sapply` but the output would have looked numeric and I didn't want to confuse matters. – hrbrmstr Sep 13 '14 at 01:26
0

You can parse a date from a string representation in a fixed format using strptime. You can then convert the result to a different representation using strftime.

Your desire to support non-uniquely parseable formats and decide "based on a context" is not as simple to implement and you probably want to avoid going down this way.

KT.
  • 10,815
  • 4
  • 47
  • 71
  • I must parse this data. strptime isn't parsing the data correctly because of the timestamp format. – kilojoules Sep 13 '14 at 01:30
  • Well, then prepare for some annoying coding. If a dirty hack will do for you, then I'd go for something explicit like "if (length(data) == 6) {handle 1-1-4 split} else if (length(data) == 7) {handle 1-2-4/2-1-4 and "decide based on the context"} else {handle 2-2-4 split}. – KT. Sep 13 '14 at 01:37
  • Even the very flexible `parse_date_time` or `parse_date_time2` from `lubridate` (with learning) would have a hard time with this format/requirements. – hrbrmstr Sep 13 '14 at 01:39