2

I have tried the following methods beforehand, without success:

Changing date format in R

t1$date <- dmy(t1$date_admission)

I have been trying to calculate the difference in time between two columns. Somehow, R does not recognize the format Y-m-d in one of them and returns me a wrong value, as it follows:

> [1] "2020-06-07" "2020-09-07" "2020-02-08" "2020-08-15" "2020-08-15" "2020-08-18" "2020-08-25" "2020-08-29" "2020-06-30"
[10] "2020-05-07" "2020-07-15" "2020-08-14" "2020-01-09" "2020-09-09" "2020-12-09" "2020-02-07" "2020-09-07" "2020-02-08"
[19] "2020-08-15" "2020-02-09" "2020-06-07" "2020-06-07" "2020-07-29" "2020-08-16" "2020-08-21" "2020-08-22" "2020-01-07"
[28] "2020-04-07" "2020-02-07" "2020-01-09" "2020-06-07" "2020-09-08" "2020-10-08" "2020-08-14" "2020-08-27" "2020-08-30"
[37] "2020-07-16" "2020-07-23" "2020-09-14" "2020-01-07" "2020-04-07" "2020-07-07" "2020-07-07" "2020-10-07" "2020-07-25"
[46] "2020-03-08" "2020-08-31" "2020-02-07" "2020-06-07" "2020-08-13" "2020-08-24" "2020-01-07" "2020-07-18" "2020-09-15"
[55] "2020-01-07" "2020-07-07" "2020-07-17" "2020-07-27" "2020-08-14" "2020-10-09" "2020-09-14" "2020-04-08" "2020-01-07"
[64] "2020-01-07" "2020-12-07" "2020-07-27" "2020-04-08" "2020-08-16" "2020-02-07" "2020-07-07" "2020-07-20" "2020-08-19"
[73] "2020-03-09" "2020-05-09"

> print(df$data_inicio_sint)
 [1] "2020-06-27" NA           "2020-07-29" NA           "2020-07-31" "2020-08-19" "2020-08-22" "2020-08-18" "2020-06-29"
[10] "2020-06-25" "2020-07-14" "2020-05-09" "2020-01-10" "2020-08-31" "2020-08-30" "2020-06-28" "2020-09-08" "2020-07-23"
[19] "2020-12-09" "2020-08-22" "2020-04-08" "2020-06-25" "2020-07-20" "2020-08-16" "2020-12-09" "2020-08-23" "2020-06-30"
[28] "2020-06-26" "2020-03-31" "2020-08-23" "2020-06-21" "2020-07-29" "2020-07-29" "2020-08-01" "2020-08-19" "2020-08-14"
[37] "2020-06-30" "2020-07-22" "2020-09-10" "2020-07-01" "2020-02-08" "2020-06-08" "2020-06-23" "2020-06-27" "2020-07-17"
[46] "2020-07-29" "2020-08-31" "2020-06-20" "2020-03-08" "2020-02-09" "2020-08-24" "2020-01-08" "2020-06-08" "2020-10-10"
[55] "2020-06-23" "2020-05-08" "2020-10-08" "2020-07-24" "2020-07-09" "2020-08-29" "2020-10-10" "2020-02-09" "2020-06-23"
[64] "2020-06-22" "2020-08-08" "2020-07-21" "2020-07-28" "2020-05-09" "2020-06-19" "2020-07-08" "2020-07-14" "2020-10-09"
[73] "2020-01-10" "2020-12-09"

> diff(df$data_int_uti - df$data_inicio_sint)
Time differences in days
 [1]   NA   NA   NA   NA  -16    4    8  -10  -50   50   96  -98   10   92 -243  141 -165   50  -79  255  -78   27   -9
[24] -110  109 -174   95   27 -174  213   55   30  -58   -5    8    0  -15    3 -180  235  -30  -15   88  -94 -151  143
[47] -134  225   95 -186   -1   41  -65 -143  228 -143   86   33    5  -67   85 -227    1  288 -115 -117  210 -232  132
[70]    7  -57  110 -273

Expected outcome: Time interval between date of symptoms and date of admission in hospital, in days, e.g.

(2020-06-07) - (2020-06-27) = 20 days

So the output would look like [1] 20 and so on

Any light would be greatly appreciated.

Here's the dput:

dput(t1) structure(list(data_int_uti = structure(c(18420, 18512, 18300, 18489, 18489, 18492, 18499, 18503, 18443, 18389, 18458, 18488, 18270, 18514, 18605, 18299, 18512, 18300, 18489, 18301, 18420, 18420, 18472, 18490, 18495, 18496, 18268, 18359, 18299, 18270, 18420, 18513, 18543, 18488, 18501, 18504, 18459, 18466, 18519, 18268, 18359, 18450, 18450, 18542, 18468, 18329, 18505, 18299, 18420, 18487, 18498, 18268, 18461, 18520, 18268, 18450, 18460, 18470, 18488, 18544, 18519, 18360, 18268, 18268, 18603, 18470, 18360, 18490, 18299, 18450, 18463, 18493, 18330, 18391), class = "Date"), data_inicio_sint = structure(c(18440, NA, 18472, NA, 18474, 18493, 18496, 18492, 18442, 18438, 18457, 18391, 18271, 18505, 18504, 18441, 18513, 18466, 18605, 18496, 18360, 18438, 18463, 18490, 18605, 18497, 18443, 18439, 18352, 18497, 18434, 18472, 18472, 18475, 18493, 18488, 18443, 18465, 18515, 18444, 18300, 18421, 18436, 18440, 18460, 18472, 18505, 18433, 18329, 18301, 18498, 18269, 18421, 18545, 18436, 18390, 18543, 18467, 18452, 18503, 18545, 18301, 18436, 18435, 18482, 18464, 18471, 18391, 18432, 18451, 18457, 18544, 18271, 18605), class = "Date")), row.names = c(NA, -74L), class = c("tbl_df", "tbl", "data.frame"))

dairelix
  • 77
  • 5

2 Answers2

2

diff is the wrong function to calculate difference between dates. You can directly subtract the dates.

t1$date_admission - t1$date_symptoms
#Time differences in days
# [1]  -20   NA -172   NA   15   -1    3   11    1  -49    1   97   -1    9  101
#[16] -142   -1 -166 -116 -195   60  -18    9    0 -110   -1 -175  -80  -53 -227
#[31]  -14   41   71   13    8   16   16    1    4 -176   59   29   14  102    8
#[46] -143    0 -134   91  186    0   -1   40  -25 -168   60  -83    3   36   41
#[61]  -26   59 -168 -167  121    6 -111   99 -133   -1    6  -51   59 -214

You might be trying to use difftime :

difftime(t1$date_admission, t1$date_symptoms, units = "days")

diff function subtracts consecutive values. See for example :

diff(c(5, 9, 4, 5))
#[1]  4 -5  1

where the calculation is (9 - 5 = 4), (4 - 9 = -5) and (5 - 4 = 1). In your case you are first subtracting the dates and then taking applying diff on them to get difference between consecutive numbers.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hey. I have tried using ```diftime``` and ```t1$date_admission - t1$date_symptoms```, but none of it worked. Both of them return exactly the same values as the one in the post (-20, NA, -172 and so on) – dairelix Nov 03 '20 at 12:59
  • 1
    @dairelix Can you update your post with the expected output. – Ronak Shah Nov 03 '20 at 13:00
  • 1
    The `dput` which you have shared is different from the data that you have shown. `t1$date_admission[1]` is `"2020-06-07"` in your data whereas you have shown it as `(2020-08-07) ` – Ronak Shah Nov 03 '20 at 13:25
  • My bad. Can you please check once again? I believe I have mistakenly used ``select()```. – dairelix Nov 03 '20 at 13:32
  • 1
    Can you check your data again as well? `t1$data_int_uti[1]` is still `"2020-06-07"`. Nothing has changed in the data. – Ronak Shah Nov 03 '20 at 14:17
  • I finally found my mistake. I'm so sorry, there are multiple similar columns and I've filtered the wrong ones. – dairelix Nov 03 '20 at 14:48
  • 1
    @dairelix For me `t1$data_inicio_sint - t1$data_int_uti` gives `20 NA 172 NA -15 ....` Isn't that what you want? and same numbers with `difftime`. – Ronak Shah Nov 03 '20 at 14:50
  • I see, I was using the wrong column. That's such a stupid mistake of mine. I'm sorry. Thank you very much. – dairelix Nov 03 '20 at 15:02
1

one solution is with dplyr and converting to date

library(dplyr)
# example data
t1 <- data.frame(date_admission = c("2020-08-07","2020-07-31","2020-02-08","2020-08-15","2020-08-17","2020-08-24","2020-08-27","2020-10-09","2020-01-07"),
             date_symptoms = c( "2020-06-27", NA           ,"2020-07-29", NA          , "2020-07-31", "2020-08-19", "2020-08-22", "2020-08-18", "2020-06-29"))

# calculation (convert all columns to date and substract according to your example)
t1 %>% 
   dplyr::mutate_all(~ as.Date(.)) %>% 
   dplyr::mutate(DIF = date_admission - date_symptoms)
DPH
  • 4,244
  • 1
  • 8
  • 18