19

I have a column with dates as character in the format 10/17/2017 12:00:00 AM. I want parse the string and keep only the date part as class Date, i.e. 2017-10-17. I am using -

df$ReportDate = as.Date(df$ReportDate, format = "%m/%d/%Y %I:%M:%S %p") 
df$ReportDate = as.Date(format(df$ReportDate, "%Y-%m-%d"))

this works, but the dataframe has over 5 million rows so this takes close to two mins.

  user  system elapsed 
104.73    0.55  105.46 

Is there a faster and more efficient way to do this?

Henrik
  • 65,555
  • 14
  • 143
  • 159
phil_t
  • 851
  • 2
  • 7
  • 17
  • An excel file has over 5M rows? I am guessing its a csv file. try using `fread` from `datatable` package – Dinesh.hmn Oct 17 '17 at 15:29
  • Sorry I think my question was unclear. It is not the reading process that takes time but just the date conversion process. Will edit. – phil_t Oct 17 '17 at 15:32
  • 1
    Is this for computations or for reporting. Your two lines of code return equivalent vectors, at least for me. `df <- rep("8/31/2017 12:30:00 pm", 5e6); df <- as.Date(df, format = "%m/%d/%Y %I:%M:%S %p"); df1 <- as.Date(format(df, "%Y-%m-%d")); all(df == df1)` – emilliman5 Oct 17 '17 at 15:44
  • You could use `substr` to extract only the date part - leave the time behind - and then use `as.Date(df$ReportDate, format = "%m/%d/%Y)` might be faster without parsing the time as well - I'm not sure. I'm also not sure why you are using `as.Date` twice. Isn't once enough? – Gregor Thomas Oct 17 '17 at 15:44
  • Sorry, why dont you try using `strptime` – Dinesh.hmn Oct 17 '17 at 15:50
  • I'd also suggest a dupe: [Is there a fast parser for Date?](https://stackoverflow.com/q/35247063/903061). The answer there suggests a roughly 2x speedup using the `fasttime` package. – Gregor Thomas Oct 17 '17 at 15:51
  • 3
    `as.Date("10/17/2017 12:00:00 AM", format = "%m/%d/%Y")` is enough. "_Each input string is processed as far as necessary for the format specified: any trailing characters are ignored._" – Henrik Oct 17 '17 at 15:59
  • @Gregor I use as.Date twice because the dates are originally stored in character type (the dataframe is imported using fread). So the first conversion converts to date object. I then extract the date using `format`, which converts the type back to character. Since I need it as a date object for further operations, I use as.Date a second time to convert to date object. – phil_t Oct 17 '17 at 16:00
  • @Henrik That works! And it takes 4 seconds, so about 25x faster. If you post as answer, I will accept. – phil_t Oct 17 '17 at 16:03
  • The second conversion is redundant. As you say, the first conversion converts to a Date object. The exact same Date object that your second line converts to. – Gregor Thomas Oct 17 '17 at 16:03

2 Answers2

20

Note that as.Date will ignore junk after the date so this takes less than 10 seconds on my not particularly fast laptop:

xx <- rep("10/17/2017 12:00:00 AM", 5000000) # test input
system.time(as.Date(xx, "%m/%d/%Y"))
## user  system elapsed 
## 9.57    0.20    9.82 
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

It might be more efficient to convert from Date to character:

# Create dummy data
date_from <- as.Date('01/01/1999  12:00:00 AM',
    format = "%m/%d/%Y %I:%M:%S %p")

date_to <- as.Date('01/01/2017  12:00:00 AM',
    format = "%m/%d/%Y %I:%M:%S %p")

df <- data.frame(ReportDate = sample(seq(from = date_from,
    to = date_to, by = "day"), 5000000, T))

# Convert to char
start <- Sys.time()
df$ReportDate <- as.character(df$ReportDate)

Sys.time() - start
Time difference of 12.37254 secs

head(df)
  ReportDate
1 2011-08-04
2 2013-11-15
3 2002-09-08
4 2011-07-01
5 2011-01-22
6 2001-01-04
tobiasegli_te
  • 1,413
  • 1
  • 12
  • 18
  • I need to keep it as `Date` for further operations hence the final conversion using `as.Date()` – phil_t Oct 17 '17 at 16:05
  • I see, that was not entirely clear from the question. I had overlooked that you initially have the data stored as `character` and not `Date`, so it makes sense to use @Henriks solution. – tobiasegli_te Oct 17 '17 at 16:09