0

I want to format several columns in datatable/dataframe using lubridate and column indexing.

Suppose that there is a very large data set which has several unformatted date columns. The question is how can I identify those columns (most likely through indexing) and then format them at the same time in one script using lubridate.

library(data.table)
library (lubridate) 

> dt <- data.frame(date1 = c("14.01.2009", "9/2/2005",  "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))
> dt
       date1     var1      date2
1 14.01.2009 2.919293 09.01.2009
2   9/2/2005 2.390123  23/8/2005
3  24/1/2010 0.878209 17.01.2000
4 28.01.2014 2.224461 04.01.2005

dt <- setDT(dt)

I tried these :

> dmy(dt$date1,dt$date2)# his dose not generate two columns
[1] "2009-01-14" "2005-02-09" "2010-01-24" "2014-01-28" "2009-01-09" "2005-08-23"
[7] "2000-01-17" "2005-01-04"

> as.data.frame(dmy(dt$date1,dt$date2)) 
  dmy(dt$date1, dt$date2) # this dose not generate two columns either  
1              2009-01-14
2              2005-02-09
3              2010-01-24
4              2014-01-28
5              2009-01-09
6              2005-08-23
7              2000-01-17
8              2005-01-04


dmy(dt[,.SD, .SD =c(1,3)])
[1] NA NA

> sapply(dmy(dt$date1,dt$date2),dmy)
[1] NA NA NA NA NA NA NA NA
Warning messages:
1: All formats failed to parse. No formats found. 

Any help is highly appreciated.

vestland
  • 55,229
  • 37
  • 187
  • 305
Daniel
  • 1,202
  • 2
  • 16
  • 25

3 Answers3

2

How about:

dt <- data.frame(date1 = c("14.01.2009", "9/2/2005",  "24/1/2010", "28.01.2014"),var1 = rnorm(4,2,1), date2 = c("09.01.2009", "23/8/2005","17.01.2000", "04.01.2005"))

for(i in c(1,3)){
     dt[,i] <- dmy(dt[,i])
}
Luís Telles
  • 694
  • 3
  • 13
  • 1
    Not really a dynamic solution. Only works if you have exactly the example data – Val Jul 12 '17 at 14:00
  • How about the data.table format? Do you know how I can do it if it would be the data.table . because your answer only works for data.frame. – Daniel Jul 12 '17 at 14:01
  • 1
    I'm not a `data.table`user. Maybe the answer from @juan fits you better! – Luís Telles Jul 12 '17 at 14:02
  • I found @LuisTelles solution worked fine on a set of about 60 contiguous factor columns in my dataframe that needed conversion to date. – Ben Oct 10 '18 at 18:04
1

Here's a data.table way. Suppose you have k columns named dateX:

k = 2
date_cols = paste0('date', 1:k)
for (col in date_cols) {
    set(dt, j=col, value=dmy(dt[[col]])
}

You can avoid the loop, but apparently the loop may be faster; see this answer

dt[,(date_cols) := lapply(.SD, dmy), .SDcols=date_cols]

EDIT

If you have aribitray column names, assuming data looks as in OP

date_cols = names(dt)[grep("^\\d{4}(\\.|/)", names(dt))]
date_cols = c(date_cols, names(dt)[grep("(\\.|/)\\d{4}", names(dt))])

You can add regular expressions if there are more delimiters than . or /, and you can combine this into a single grep but this is clearer to me.

juan
  • 398
  • 1
  • 14
  • Also not really dynamic. OP said column names could be _anything_ – Val Jul 12 '17 at 14:07
  • So? OP can collect all of the relevant column names and the solution still applies. Also, that wasn't in OP's original post. – juan Jul 12 '17 at 14:09
  • Well, it says it's a very large dataset. You're proposing a solution that won't work. The post stays "The question is that how I can identified those columns" ... something you don't explain. – Val Jul 12 '17 at 14:12
  • Yes, you are correct that I don't explain how to identify columns with *arbitrary* names, which was not clear from the original post. – juan Jul 12 '17 at 14:19
  • You should read the comments, they tend to contain valuable information. – Val Jul 12 '17 at 14:24
  • There's no need for snark. I posted before seeing that comment. Also, I just edited my answer. Thank you, though. – juan Jul 12 '17 at 14:25
1

Far from perfect, this is a solution which should be more general:

The only assumption here is, that the date columns contain digits separated by either . , / or -. If there's other separators, they may be added. But if you have another variable which is similar, but not a date, this won't work well.

for (j in seq_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) dt[,j] <- dmy(dt[,j])

This loops through the columns and checks if a date could be present using regular expressions. If so, it will convert it to a date and overwrite the column.

Using data.table:

for (j in seg_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) set(dt,j = j, value = dmy(dt[[j]]))

You could also replace all with any with the idea that if you have any match in the column, you could assume all of the values in that column are dates which can be read by dmy.

Val
  • 6,585
  • 5
  • 22
  • 52
  • I think you should add the `-` character to your regex in `grep`, he has dates in the format `yyyy-mm-dd`. This is really a more generalistic approach and I think it's better fitted to the situation. – Luís Telles Jul 12 '17 at 14:30
  • Yeah right, should be added. Didn't see it in the example. – Val Jul 12 '17 at 14:33
  • @Val Thanks a lot for such a comprehensive response ! – Daniel Jul 12 '17 at 14:38
  • Hello, I tried the script: for (j in seg_along(dt)) if (all(grepl('\\d+(\\.|/|-)\\d+(\\.|/|-)\\d+',dt[,j]))) set(dt,j = j, value = dmy(dt[[j]])) works for a dataframe but not for data.table – Mathieu L Jun 02 '20 at 15:42