0

My R is rusted and I am struggling to find the answer to this rather simple issue. I wish to create a new column based on whether a date entry in the date column is present in another vector.

To illustrate the issue, I count count the number of rows as follows (this approach works):

sum(as.numeric(block$date == "2019-10-11  06:30:00"))

and it correctly gives me 1.

Should I do this however:

sum(as.numeric(block$date %in% c("2019-10-11  06:30:00")))

I get 0, which is a problem since I need check against more than one date-time value.

Sample of data frame as follows:

                  date Efficiency    PowAC    PowDC  TempCPU TempIGBT failures
1: 2019-10-11 06:30:00   97.77433 488.0686 593.1467 32.04367 49.16300        0
2: 2019-03-18 15:25:00   97.79300 485.2857 590.2600 32.29633 50.02533        0
3: 2019-03-18 15:30:00   97.78000 484.7714 589.6767 32.02700 49.22233        0
4: 2019-03-18 15:35:00   97.78233 482.2714 586.6633 32.26733 49.56700        0
5: 2019-03-18 15:40:00   97.75700 480.3343 585.2167 32.02000 49.18667        0
6: 2019-03-18 15:45:00   97.80400 477.5114 580.5467 32.21833 49.30067        0
7: 2019-03-18 15:50:00   97.79633 474.8886 578.0433 32.02833 48.86067        0
8: 2019-03-18 15:55:00   97.79400 477.2629 581.0667 32.29933 49.45333        0

and dput(block, head(10) as follows:

library(data.table)
setDT(structure(list(date = structure(c(1546300800, 1546301100, 1546301400, 
1546301700, 1546302000, 1546302300, 1546302600, 1546302900, 1546303200, 
1546303500), class = c("POSIXct", "POSIXt")), Efficiency = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0), PowAC = c(NaN, NaN, NaN, NaN, NaN, 
NaN, NaN, NaN, NaN, NaN), PowDC = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), TempCPU = c(NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, 
NaN), TempIGBT = c(NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, 
NaN), failures = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"), sorted = "date"))

The vector I am testing against is as follows:

dput(failures)
c("2019-10-11 06:30:00", "2019-10-12 06:30:00", "2019-10-12 17:45:00", 
"2019-10-13 06:30:00")

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Paul
  • 813
  • 11
  • 27
  • 3
    Can you show the `dput(block)` so that it can be tested – akrun Jul 19 '21 at 18:02
  • 1
    @Paual the reason we are asking for `dput` is to get the structure of the data i.e. we don't know whether your column class is POSIXct or simply character etc – akrun Jul 19 '21 at 18:10
  • 1
    Paul, the request for `dput(.)` is specific and for a good reason: there are a lot of easy hacks for us to use data pasted into a code block on SO, but most of them fail when there are embedded spaces (as there are in your `date` column) or the class is ambiguous (same column). Please copy the output from `dput(head(x,10))` and enter it into that (or a new) code block. Thanks. – r2evans Jul 19 '21 at 18:10
  • Further, your sample code has two mid-spaces whereas your data only has one, which is confusing. I'm guessing that your first code sample was just typed in (and not copied from working code), but ... that is actually part of a bigger problem, a question that is not reproducible. Please help us help you! – r2evans Jul 19 '21 at 18:12
  • FYI, today `"2021-07-19" == Sys.Date()` is true and `"2021-07-19" %in% c(Sys.Date())` (and `.. %in% Sys.Date()` is false. We just need to know more about your data to advise how to get it done best. – r2evans Jul 19 '21 at 18:13
  • What is your other vector? Can you `dput(.)` that as well? (And please make sure it contains at least one and not all of the `date` values in this sample data.) Thanks! – r2evans Jul 19 '21 at 18:15
  • FYI, `DT$date %in% "2018-12-31 19:30:00"` returns all false, whereas `DT$date %in% as.POSIXct("2018-12-31 19:30:00")` returns a true. Are you certain that the vector you're using for set-membership is `POSIXt`-class? – r2evans Jul 19 '21 at 18:21
  • 2
    And there it is ... your `failures` vector is `character` and not `POSIXt`. Try `DT$date %in% as.POSIXct(failures)`. I can't know for sure, because your vector has nothing in the sample data, but my guess is that you are inadvertently mixing classes in your comparisons. `==` seems to be a little more permissive with that, `%in%` less-so. – r2evans Jul 19 '21 at 18:23
  • Perfect. Thanks r2evans. Problem solved. – Paul Jul 19 '21 at 18:25
  • 2
    (Now do you see the value of `dput(.)`? :-) – r2evans Jul 19 '21 at 18:31

1 Answers1

2

Your classes must match up.

I'll start by assigning a slightly-modified failures to include a relevant date

failures <- c("2018-12-31 19:30:00", "2019-10-12 06:30:00", "2019-10-12 17:45:00", "2019-10-13 06:30:00")

(though it is still character), and use your block from the structure(.) output.

block$date %in% failures
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
DT$date %in% as.POSIXct(failures)
#  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

There could be other problems with strict equality and set-membership:

  • Time zones: even if the displayed times are the same, if the zones are different then the times are different. One can have two times of different zones be equal (despite the representationg of it on the console looking different), but I don't think that's what you have here.

  • POSIXt and Date are actually numeric underneath, which means that they are floating-point. R tends to be "good enough" to determine equality when the times and/or dates are nearly integral, but even floating-point equality can be a problem, and hard to find since it sometimes works, sometimes doesn't. A common comment I add to answers when I see this as the culprit is this:

    Computers have limitations when it comes to floating-point numbers (aka double, numeric, float). This is a fundamental limitation of computers in general, in how they deal with non-integer numbers. This is not specific to any one programming language. There are some add-on libraries or packages that are much better at arbitrary-precision math, but I believe most main-stream languages (this is relative/subjective, I admit) do not use these by default. Refs: Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754

    While it does not appear to be the case here, it could be. Keep it in mind :-)

r2evans
  • 141,215
  • 6
  • 77
  • 149