1

I acknowledge that there are multiple similar questions, but I did not find so far an answer, which works for me, maybe because of AM/PM. I would like not to remove the latter.

I have a column Trip Start Timestamp, which looks like

12/01/2019 12:30:00 AM
12/01/2019 12:31:00 AM
12/01/2019 12:32:00 AM 

I am trying to remove AM/PM and split into two variables Start date and Start time.

Ideal output:

Start date Start time
12/01/2019 12:30:00 AM

I would like to read it as time series.

My best guess so far

Date <- format(as.POSIXct(strptime(taxi_2020$`Trip Start Timestamp`, "%d/%m/%Y %H:%M:S")), format = "%m/%d/%Y")
Time <- format(as.POSIXct(strptime(taxi_2020$`Trip Start Timestamp`, "%d/%m/%Y %H:%M:S")), format = "%H:%M:S")
head(Date)
head(Time)

Gives me

[1] NA NA NA NA NA NA
[1] NA NA NA NA NA NA

Update

It looks like columns with time and date has some problems with format.

dput can be found here

So far the solution by @Ronak Shah works. Technically I finally separated the data, but probably identified another problem with UTF-8

Anakin Skywalker
  • 2,400
  • 5
  • 35
  • 63

4 Answers4

3

Since you tagged this with tidyverse, here's a simple approach with lubridate:

library(dplyr)
library(lubridate)
data %>%
  mutate(Date = as.Date(mdy_hms(`Trip Start Timestamp`)), 
         Time = format(mdy_hms(`Trip Start Timestamp`), "%I:%M:%S %p"))
#    Trip Start Timestamp       Date        Time
#1 12/01/2019 12:30:00 AM 2019-12-01 12:30:00 AM
#2 12/01/2019 12:31:00 AM 2019-12-01 12:31:00 AM
#3 12/01/2019 12:32:00 AM 2019-12-01 12:32:00 AM

Example Data

data <- structure(list(`Trip Start Timestamp` = c("12/01/2019 12:30:00 AM", 
"12/01/2019 12:31:00 AM", "12/01/2019 12:32:00 AM")), class = "data.frame", row.names = c(NA, 
-3L))
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • Thanks a lot, you were the first and in tidyverse, so I will accept your answer. – Anakin Skywalker Jun 16 '20 at 02:50
  • Error: Column `Time` must be length 4137294 (the number of rows) or one, not 3. – Anakin Skywalker Jun 16 '20 at 03:08
  • the code now works, but the initial column `Trip Start Timestamp` is NA now and new columns also NA. Date as NA and Time as character NA – Anakin Skywalker Jun 16 '20 at 03:13
  • 1
    I wonder if there is some sort of factor problem. Can you [edit] your post with the output of `dput(taxi_2020[1:20,])`? It’s really late, so I’ll take a look tomorrow if someone else can’t help you. – Ian Campbell Jun 16 '20 at 03:27
  • Done, it is too big. I put on google drive. Definitely some formatting problem, I cannot read it in Python too, gives me an error with some characters. Thanks! – Anakin Skywalker Jun 16 '20 at 03:50
  • Now I have this error - Error in gsub(reg$alpha_exact[["A"]], "%A", x, ignore.case = T, perl = T) : input string 1 is invalid UTF-8 – Anakin Skywalker Jun 16 '20 at 04:28
2

tidyverse

dat %>%
  mutate(
    ## option 1
    psx = as.POSIXct(V1, format = "%m/%d/%Y %I:%M:%S %p", tz = "UTC"), 
    Date = format(psx, format = "%m/%d/%Y"), 
    Time = format(psx, format = "%I:%M:%S %p"), 
    ## option 2
    Date2 = sub("\\s.*", "", V1), 
    Time2 = sub("^\\S*\\s", "", V1)
  )
#                        V1                 psx       Date        Time      Date2        Time2
# 1  12/01/2019 12:30:00 AM 2019-12-01 12:30:00 12/01/2019 12:30:00 PM 12/01/2019  12:30:00 AM
# 2  12/01/2019 12:31:00 AM 2019-12-01 12:31:00 12/01/2019 12:31:00 PM 12/01/2019  12:31:00 AM
# 3 12/01/2019 12:32:00 AM  2019-12-01 12:32:00 12/01/2019 12:32:00 PM 12/01/2019 12:32:00 AM 

Data

dat <- read.table(header = FALSE, text = "
12/01/2019 12:30:00 AM
12/01/2019 12:31:00 AM
12/01/2019 12:32:00 AM ", sep = "|")
r2evans
  • 141,215
  • 6
  • 77
  • 149
2

Note that keeping the output in the format that you have shown would return columns as characters/factors.

You can split the data on whitespace using tidyr::separate

tidyr::separate(data, `Trip Start Timestamp`, c('Start Date', 'Start Time'), 
                       sep = ' ', extra = 'merge')

#  Start Date  Start Time
#1 12/01/2019 12:30:00 AM
#2 12/01/2019 12:31:00 AM
#3 12/01/2019 12:32:00 AM

Similarly, you can use extract :

tidyr::extract(data, `Trip Start Timestamp`, c('Start Date', 'Start Time'), 
               regex = '(.*?)\\s(.*)')
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • But the first code returns columns as characters will all NAs. Any idea why? – Anakin Skywalker Jun 16 '20 at 03:02
  • 1
    @Rookie Your data must be different then. I am using Ian's data. Can you test it on that data? – Ronak Shah Jun 16 '20 at 03:05
  • Does not work, unfortunately. Error: Column `Time` must be length 4137294 (the number of rows) or one, not 3 – Anakin Skywalker Jun 16 '20 at 03:07
  • 1
    @Rookie In your `dput` I think the relevant part is only `structure(list(\`Trip Start Timestamp\` = c("12/01/2019 12:15:00 AM", "12/01/2019 12:15:00 AM", "12/01/2019 12:15:00 AM"), \`Trip End Timestamp\` = c("12/01/2019 12:30:00 AM", "12/01/2019 12:15:00 AM", "12/01/2019 12:30:00 AM")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))` . Rest all doesn't get picked up ? And for this data my answer works fine for me. – Ronak Shah Jun 16 '20 at 04:05
  • apologize! Update to 20 rows – Anakin Skywalker Jun 16 '20 at 04:26
  • I get this error, using Ian's code now - Error in gsub(reg$alpha_exact[["A"]], "%A", x, ignore.case = T, perl = T) : input string 1 is invalid UTF-8 – Anakin Skywalker Jun 16 '20 at 04:28
  • 1
    @Rookie My answer still works fine on your updated data for me. Does it work for you? – Ronak Shah Jun 16 '20 at 04:30
  • Yes, it did, I finally managed to split the data! But it looks like I have another problem with UTF-8. See the updated question. I will dig into it! Appreciate your help. Not sure why tidyverse solutions failed. – Anakin Skywalker Jun 16 '20 at 04:39
2

We can use base R to split

out <- do.call(rbind.data.frame, strsplit(data[[1]],
        "(?<=[0-9]) (?=[0-9])", perl = TRUE))
names(out) <- c('Start Date', 'Start Time')
out
#  Start Date  Start Time
#1 12/01/2019 12:30:00 AM
#2 12/01/2019 12:31:00 AM
#3 12/01/2019 12:32:00 AM

data

data <- structure(list(`Trip Start Timestamp` = c("12/01/2019 12:30:00 AM", 
"12/01/2019 12:31:00 AM", "12/01/2019 12:32:00 AM")), class = "data.frame", row.names = c(NA, 
-3L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • But it gives me Error in data[[1]] : object of type 'closure' is not subsettable – Anakin Skywalker Jun 16 '20 at 03:04
  • 1
    @Rookie Can you check the `str(yourdata)`. I assume it is a `data.frame` or `tibble` as it should work with those structures – akrun Jun 16 '20 at 18:29
  • tibble [4,137,294 x 21] (S3: tbl_df/tbl/data.frame) $ Start Date : chr [1:4137294] "12/01/2019" "12/01/2019" "12/01/2019" "12/01/2019" ... $ Start Time : chr [1:4137294] "12:15:00 AM" "12:15:00 AM" "12:15:00 AM" "12:15:00 AM" ... $ End Date : chr [1:4137294] "12/01/2019" "12/01/2019" "12/01/2019" "12/01/2019" ... $ End Time : chr [1:4137294] "12:30:00 AM" "12:15:00 AM" "12:30:00 AM" "12:30:00 AM" ... – Anakin Skywalker Jun 16 '20 at 18:38
  • @Rookie Are you splitting on "Start Time"? In that case use `strsplit(data[["Start Time"]], ...` – akrun Jun 16 '20 at 18:41
  • @Rookie Based on the image, The "Trip End Timestamp" is 3rd column. So, you may need `strsplit(taxi_2020_test[["Trip End Timestamp"]], ...` – akrun Jun 16 '20 at 18:42
  • Sorry, did not get what you mean, I separated into Start Date and Start Time already as Ronak suggested above – Anakin Skywalker Jun 16 '20 at 18:43
  • @Rookie I meant that the `strsplit(data[[1]]` is based on the assumption that it is the first column, if it is not, then change it to the index of the column or the actual column name – akrun Jun 16 '20 at 18:44