1

I know this question is asked a lot, but I only come to you because I tried everything (including the tips from similar questions that I managed to understand).

I have a rather big CSV file (> 16 000 rows), with, among others, a "Date" column, containing dates in the following format "01/01/1999". However, when loading the file, the column is not recognised as a date, but as a Factor with read.csv2, or a character with fread (data.table package). I loaded the lubridate library. In both cases, I tried to convert the column to a date format, using all methods I knew (column = Date, data = test):

as.Date(test$Date, format = "%d/%m/%Y", tz = "")

Or

strptime(test$Date, format = "%d/%m/%y", tz = "")

Or

as_date(test$Date)

And also the function dmy from lubridate, and

as.POSIXct(test$Date, "%d/%m/%y", tz = "").

I also tried changing the format: ymd instead of dmy, "-" instead of "/". I even tried to change the character class to numeric (when loaded with fread), and the factor class to numeric (when loaded with read.csv2).

Despite all of this, the columns stay in their factor / character classes.

Does someone know what I missed?

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • How can you tell it is "%d/%m/%Y" as opposed to "%m/%d/%Y"? – akash87 Jan 21 '20 at 14:31
  • 1
    Just show us an excerpt of your data using `dput`, that we know what you're talking about. Please read: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 – jay.sf Jan 21 '20 at 14:34
  • Just out of curiosity: you assigned your result, right? (`<-`) It sounds kind of suspicious that you write "*the columsn stay in their factor / character classes*" instead of "*this was the error I got*" – Georgery Jan 21 '20 at 14:43
  • akash87 : it is European data, the days are first, I made the mistake of not showing the data, so you could not see it. – Maître Cheminade Jan 21 '20 at 14:51
  • You don't need to add explanation what has solved your problem. Accepting the answer is enough. Hope to see you around the community. Cheers. – M-- Jan 21 '20 at 15:12

2 Answers2

1

Just use the anydate() function from the anytime package:

R> library(anytime)
R> var <- as.factor(c("01/01/1999", "01/02/1999"))
R> anydate(var)
[1] "1999-01-01" "1999-01-02"
R> 
R> class(anydate(var))
[1] "Date"
R> 
R> class(var)
[1] "factor"
R> 
R> 

It will read just about any input time, and convert it without requiring a format and this works as long as the represented is somewhat standard (i.e. we do not work with two-digit years etc).

(Otherwise you can of course also use the base R functions after first converting from factor back to character via as.character(). But anytime() and anydate() do that, and much more, for you too.)

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • You are most welcome. Now, the way this works at StackOverflow, is that you can also "upvote" my answer (by clicking on the up-triangle) and "accept" my answer (by clicking on the tickmark -- only you as the question-asker see that). This is a good system to reward good questions, and good answers. By accepting you also get some karma points. (And as very new user you may not yet be able to upvote. That will come once you interacted more with the site.) – Dirk Eddelbuettel Jan 21 '20 at 15:11
  • Hello again, I think I identified why there was a problem in the first place, something that also created another problem with the "anydate" function : my data is in european date format (%d/%m/%Y), however anydate identified it as an american date format, returning all days after the 12th of a month as NA, since it reads them as months. Is there any way to work around this ? I tried to manually change the format in the .CSV file (by changing cell format), but then anydate only gives "NA". Thank you very much for your time – Maître Cheminade Jan 21 '20 at 19:21
  • I was worried about that too, and yes you can. In short, between Europe and the US the d.m and m/d formats can lead to confusion. So we heuristically guess on hyphen and slashes which "preferred by geography". But you can override the formats and your own at the beginning of the search list---see `help(addFormats)` for details, and `addFormats("%d/%m/%Y")` should do what you need. – Dirk Eddelbuettel Jan 21 '20 at 20:10
0

If you are using read.csv2, try

read.csv2(..., stringsAsFactors=F)

and then continue with as.Date

SebSta
  • 476
  • 2
  • 12