0

I'm using R to extract data from multiple pdf files. After extracting this data, I need to format a field of type date. Dates came from pdf files in various formats


data = c( "1/10/2022 2:36:00 pm",
  "1/11/2022 12:47:00 pm",
  "1/12/2022 9:47:00 am",
  "1/13/2022 9:21:00 am",
  "1/14/2022 12:59:00 pm",
  "1/10/2022 2:39:00 pm",
  "1/11/2022 12:46:00 pm",
  "1/12/2022 9:48:00 am",
  "1/13/2022 9:22:00 am",
  "1/14/2022 1:00:00 pm",
  "1/10/2022 2:40:00 pm",
  "1/11/2022 12:45:00 pm",
  "1/12/2022 9:49:00 am",
  "1/13/2022 9:23:00 am",
  "1/14/2022 1:01:00 pm",
  "1/10/2022 2:42:00 pm",
  "1/11/2022 12:44:00 pm",
  "1/12/2022 9:50:00 am",
  "1/13/2022 9:24:00 am",
  "1/14/2022 1:02:00 pm",
  "1/10/2022 2:44:00 pm",
  "1/11/2022 12:43:00 pm",
  "1/12/2022 9:51:00 am",
  "1/13/2022 9:25:00 am",
  "1/14/2022 1:03:00 pm",
  "10/01/2022 14:36:00",
  "11/01/2022 12:47:00",
  "12/01/2022 09:47:00",
  "13/01/2022 09:21:00",
  "14/01/2022 12:59:00",
  "10/01/2022 14:39:00",
  "11/01/2022 12:46:00",
  "12/01/2022 09:48:00",
  "13/01/2022 09:22:00",
  "14/01/2022 13:00:00",
  "10/01/2022 14:40:00",
  "11/01/2022 12:45:00",
  "12/01/2022 09:49:00",
  "13/01/2022 09:23:00",
  "14/01/2022 13:01:00",
  "10/01/2022 14:42:00")
df <- data.frame(data)

I tried as follows


date_parser <- function(d) {
  if (endsWith(d, "m")) {
    as.POSIXct(d, format = "%m/%d/%Y %I:%M %p")
  } else {
    as.POSIXct(d, format = "%d/%m/%Y %H:%M")
  }
}

df = df%>%
  rowwise %>%
  mutate(data = date_parser(data)) 

I get this error message

Error in mutate:In argument: data_hora_leitura_coc08 = date_parser(data_hora_leitura_coc08).In row 1732.Caused by error in `if (endsWith(d, "m")) missing value where TRUE/FALSE needed

Also, some dates come out as "NA" How to solve it?

3 Answers3

0

Your sample data has seconds, so you need to add %S to each of your formats.

date_parser <- function(d) {
  if (endsWith(d, "m")) {
    as.POSIXct(d, format = "%m/%d/%Y %I:%M:%S %p")
  } else {
    as.POSIXct(d, format = "%d/%m/%Y %H:%M:%S")
  }
}
df %>%
  rowwise() %>%
  mutate(data = date_parser(data))
# # A tibble: 41 × 1
# # Rowwise: 
#    data               
#    <dttm>             
#  1 2022-01-10 14:36:00
#  2 2022-01-11 12:47:00
#  3 2022-01-12 09:47:00
#  4 2022-01-13 09:21:00
#  5 2022-01-14 12:59:00
#  6 2022-01-10 14:39:00
#  7 2022-01-11 12:46:00
#  8 2022-01-12 09:48:00
#  9 2022-01-13 09:22:00
# 10 2022-01-14 13:00:00
# # ℹ 31 more rows
# # ℹ Use `print(n = ...)` to see more rows
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Some dates are still coming out in NA format. How do I share all date records with you? There are more than 6,000 lines – Francisco Ribeiro Aug 03 '23 at 10:56
  • Find the ones that are `NA` in the output and share some samples of the originals of those. (You may want to do `df %>% rowwise() %>% mutate(data2 = date_parser(data)) %>% filter(is.na(data2))` and take some samples from that. You may also want to look at https://stackoverflow.com/a/52319606/3358272 (dates), https://stackoverflow.com/a/70304571/3358272 (POSIXct) for ways to handle multiple "candidate formats" of your timestamps. – r2evans Aug 03 '23 at 14:00
0

lubridate does exactly what you want. Try

df %>% 
   mutate(parsed_data = if_else(endsWith(data, "m"), mdy_hms(data), dmy_hms(data)))
konvas
  • 14,126
  • 2
  • 40
  • 46
0

With parse_date_time

library(lubridate)

df %>% mutate(new = parse_date_time(data, 
  orders=c("mdY IMS p", "dmY HMS", "mdY IM p", "dmY HM")))
                    data                 new
1   1/10/2022 2:36:00 pm 2022-01-10 14:36:00
2  1/11/2022 12:47:00 pm 2022-01-11 12:47:00
3   1/12/2022 9:47:00 am 2022-01-12 09:47:00
4   1/13/2022 9:21:00 am 2022-01-13 09:21:00
5  1/14/2022 12:59:00 pm 2022-01-14 12:59:00
6    10/01/2022 14:36:00 2022-01-10 14:36:00
7    11/01/2022 12:47:00 2022-11-01 12:47:00
8    12/01/2022 09:47:00 2022-12-01 09:47:00
9    13/01/2022 09:21:00 2022-01-13 09:21:00
10   14/01/2022 12:59:00 2022-01-14 12:59:00
11    12/2/2022 10:16 am 2022-12-02 10:16:00
12   12/11/2022 12:28 pm 2022-12-11 12:28:00
13      07/12/2022 10:49 2022-07-12 10:49:00
14      2/1/2023 5:10 pm 2023-02-01 17:10:00
15      28/01/2023 10:39 2023-01-28 10:39:00
16      2/1/2023 9:58 am 2023-02-01 09:58:00
17      06/12/2022 09:20 2022-06-12 09:20:00
18      07/12/2022 10:08 2022-07-12 10:08:00
19      08/12/2022 07:48 2022-08-12 07:48:00

Data

df <- structure(list(data = c("1/10/2022 2:36:00 pm", "1/11/2022 12:47:00 pm", 
"1/12/2022 9:47:00 am", "1/13/2022 9:21:00 am", "1/14/2022 12:59:00 pm", 
"10/01/2022 14:36:00", "11/01/2022 12:47:00", "12/01/2022 09:47:00", 
"13/01/2022 09:21:00", "14/01/2022 12:59:00", "12/2/2022 10:16 am", 
"12/11/2022 12:28 pm", "07/12/2022 10:49", "2/1/2023 5:10 pm", 
"28/01/2023 10:39", "2/1/2023 9:58 am", "06/12/2022 09:20", "07/12/2022 10:08", 
"08/12/2022 07:48")), class = "data.frame", row.names = c(NA, 
-19L))
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
  • Warning message: There was 1 warning in `mutate()`. ℹ In argument: `new = parse_date_time(...)`. Caused by warning: ! 2979 failed to parse. – Francisco Ribeiro Aug 03 '23 at 11:00
  • @FranciscoRibeiro ... means you have dates other than the ones you showed us. Try to find rows that deviate from the pattern. May even be that you have rows that can't be interpreted as a proper date. – Andre Wildberg Aug 03 '23 at 11:16
  • Some dates are still coming out in NA format. How do I share all date records with you? There are more than 6,000 lines – Francisco Ribeiro Aug 03 '23 at 11:27
  • @FranciscoRibeiro You can try to search for differing patterns, e.g. `df$data[!grepl(".*/.*/.* .*:.*:", df$data)]` and post those with `dput(head(df$data[!grepl(".*/.*/.* .*:.*:", df$data)], 10))` – Andre Wildberg Aug 03 '23 at 11:37
  • "12/2/2022 10:16 am", "12/11/2022 12:28 pm", "07/12/2022 10:49", "2/1/2023 5:10 pm", "28/01/2023 10:39", "2/1/2023 9:58 am" , "06/12/2022 09:20", "07/12/2022 10:08", "08/12/2022 07:48 – Francisco Ribeiro Aug 03 '23 at 11:50
  • @FranciscoRibeiro These are without seconds, should work now with edit. – Andre Wildberg Aug 03 '23 at 12:10
  • Separated in a dataframe, the times that contain am | pm. 10/1/2022 2:36:00 pm, 11/1/2022 12:47:00 pm, 12/1/2022 9:47:00 am This am|pm pattern in some cases I get NA return – Francisco Ribeiro Aug 03 '23 at 12:20
  • @FranciscoRibeiro These am/pm don't differ from the ones already present and should not pose a problem. – Andre Wildberg Aug 03 '23 at 12:48