-2

I have a large dataset (17 million respondents) across 165 variables. I have a date variable that is formatted like this:

9/16/2009 12:35:00 PM 
9/16/2009 12:54:43 PM 
9/16/2009 1:35:00 PM 
... 

I am trying to extract the year. I have tried splitstr, but can't seem to actually get the new matrix or data frame to split into columns. Any help is appreciated!

This is unique from the other extracting date questions in that I need to extract from an entire dataframe of 17million responses.

KKing
  • 9
  • 1

2 Answers2

1

It is convenient to use separate from tidyr if we need to split by 'Date' and 'Time'

library(tidyr)
df2 %>% 
   separate(DateTime, into = c("Date", "Time"), sep=" ")

If we need to extract the year, use the gsub

df2$Year <- gsub(".*/|\\s+.*", "", df2$DateTime)

Or with as.POSIXct as it is 'DateTime'

df2$Year <- format(as.POSIXct(df2$DateTime, format = "%m/%d/%Y %H:%M:%S"), "%Y")

Or as @Jaap suggested in the comments

strptime(df2$DateTime, format = '%m/%d/%Y %I:%M:%S %p')$year + 1900

The difference between as.POSIXct and strptime is the class of the output object. With as.POSIXct, it has POSIXct class, while strptime is POSIXlt and POSIXct. In addition to the above, there are convenience in using either one. If somebody wants to extract different pieces like 'month', 'year' etc, strptime is more convenient. With as.POSIXct, we can wrap format to extract the 'month', 'year'. Also, note that strptime class have some clash in using with dplyr functions


Or use lubridate

library(lubridate)
df2$Year <- year(mdy_hms(df2$DateTime))

NOTE: It is better to use Date Time functions for extracting than with string manipulation

akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I will try this thanks! – KKing Jan 19 '17 at 06:28
  • Every attempt I previously made to use date and time functions were not returning anything, I think because of the AM/PM designation at the end of the "string". The original dataset was a csv originating from a datacollection file. – KKing Jan 19 '17 at 06:56
  • 6
    An alternative: `strptime('9/16/2009 12:35:00 PM', format = '%m/%d/%Y %I:%M:%S %p')$year + 1900` – Jaap Jan 19 '17 at 07:00
  • 1
    @Jaap nice. For beginners like me this would be woth to be posted as answer (if it does not already exist). – Christoph Jan 19 '17 at 08:49
  • @Christoph I showed much more direct methods than something that requires summing with 1900. – akrun Jan 19 '17 at 08:50
  • Ok. But I like that it is base R and only one line. Are strptime and as.POSIXct connected? – Christoph Jan 19 '17 at 08:57
  • @Christoph `strptime` returns `POSiXlt` class, which may be good for extracting the elements, but it can fail if you work on `dplyr` functions, while `POSIXct` works for all cases – akrun Jan 19 '17 at 08:59
  • Thanks. I would appreciate, if you include the `strptime` and your comment in your answer! I didn't know that and I think it is important;-) – Christoph Jan 19 '17 at 09:02
0

If you want to extract the years, you can try this:

df <- read.csv(text='Date
9/16/2009 12:35:00 PM 
9/16/2009 12:54:43 PM 
9/16/2009 1:35:00 PM ')

df$year <- as.numeric(substring(as.character(df$Date), 6,9))
df$year
#[1] 2009 2009 2009
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • Thanks, I need to keep the date in rows, like the original dataset. It has 17million rows, across 165 variables (my list of 3 was just an example of how that variables is formatted). Ultimately, I need to be able to run crosstabs by year on the other variables. – KKing Jan 19 '17 at 06:27
  • ok you can then add a column as updated. – Sandipan Dey Jan 19 '17 at 06:30