0

Good evening everyone! I want to do a comparison between dates in R. I have 2 datasets, maint.csv and failures.csv:

> str(maint)
'data.frame':   3286 obs. of  3 variables:
 $ datetime : POSIXct, format: "2014-06-01 06:00:00" "2014-07-16 06:00:00" "2014-07-31 06:00:00" ...
 $ machineID: int  1 1 1 1 1 1 1 1 1 1 ...
 $ comp     : Factor w/ 4 levels "comp1","comp2",..: 2 4 3 1 4 1 3 1 4 3 ...

and

> str(failures)
'data.frame':   761 obs. of  3 variables:
 $ datetime : POSIXct, format: "2015-01-05 06:00:00" "2015-03-06 06:00:00" "2015-04-20 06:00:00" ...
 $ machineID: int  1 1 1 1 1 1 1 2 2 2 ...
 $ failure  : Factor w/ 4 levels "comp1","comp2",..: 4 1 2 4 4 2 4 1 2 2 ...

As you can see both the datetime are in POSIX format. Now, the 761 rows of failures.csv are like a subset of the 3286 row of maint.csv in the sense that more or less all the observations of failures are taken from the maint file, but there are some rows that are not present in the maint.csv I want to build a for loop that prints only the rows that are present in failures but not in maint How can I do? I've never used for, if-else in R, and especially i don't know how to compare dates. Thank you.

CasellaJr
  • 378
  • 2
  • 11
  • 26
  • (1) A `for` loop is not what you need (in R, it rarely is). (2) I think what you're talking about can be resolved with `dplyr::anti_join` or similar functionality. Can't really know without a better view of your data; can you add the output of `dput(x)`, where `x` is a representative sample of each frame? It's important that these samples: (1) have not 100s/1000s of rows; (2) have some rows in common; and (3) have some rows not in common, so that we can properly test the corner cases of the comparison. – r2evans Oct 30 '20 at 21:52
  • can you share your data? you can do it using `dput(maint)` and `dput(failures)`, and copy and paste it on your question post. – rodolfoksveiga Oct 30 '20 at 22:20
  • Try : `dplyr::anti_join(failures, maint)` to get rows present in `failures` but not in `maint`. – Ronak Shah Oct 31 '20 at 04:25
  • @RonakShah thank you, it is good also your solution! And it is also very simple :) – CasellaJr Oct 31 '20 at 08:45

1 Answers1

0

I could find the datasets you're working with on the web...

I think you can simply achieve it making use of %in%, as follows:

# setup environment
library(dplyr)
# filter only when datetime does not match
filter_failures = filter(failures, !datetime %in% maint$datetime)
# print the dataframe
head(filter_failures)

Here is the output:

             datetime machineID failure
1 2015-01-02 03:00:00        16   comp1
2 2015-01-02 03:00:00        16   comp3
3 2015-01-02 03:00:00        17   comp4
4 2015-01-02 03:00:00        22   comp1
5 2015-01-02 03:00:00        35   comp1
6 2015-01-02 03:00:00        45   comp1

Let me know if this is what you're looking for.

rodolfoksveiga
  • 1,181
  • 4
  • 17
  • thanks, it is good your solution, but i wanted the opposite result, so i used: `filter_failures = filter(failures, !datetime %in% maint$datetime)` – CasellaJr Oct 31 '20 at 08:43
  • i'm sorry, i've got it wrong.. i just updated my post with the right answer. good luck! – rodolfoksveiga Oct 31 '20 at 18:24