0

screenshot from R

I have this set of school start and end dates imported into R from excel and I'm having trouble removing duplicates. It isn't as straight forward as some of the other post on here about the topic.

Essentially if the school district on the left column has the same start date and end date for each entry, I only need to show one entry. For example that first entry of "Dewitt School District" has 5 entries, all of which have a start date of 08/19/2009 and an end date of 6/1/2010 so I need it to only show 1 entry.

Not sure if this can even be done in R but my supervisor said it can be done in STATA.

neilfws
  • 32,751
  • 5
  • 50
  • 63
  • Does `unique(INPUT)` work? It returns `data.frame` with duplicate entries removed. – pogibas Apr 17 '18 at 03:42
  • I believe unique() will simply just remove all duplicates.At least thats what I've seen on other posts. I need it to only remove the duplicates if the dates all match(for that given district) so I believe there is some additional script to write. In this screenshot the dates do in fact all match but as you can see there are over 1000 entries and I can't just assume they match throughout the whole set. – Ben Button Apr 17 '18 at 04:05
  • 1
    Have you tried it? (you can also try `distinct` from the `tidyverse` suite of packages) - http://dplyr.tidyverse.org/reference/distinct.html – Melissa Key Apr 17 '18 at 04:10

1 Answers1

1

Taking @Mellissa Key's answer.

Creating dataset with 3 duplicated events

df <- data.frame(school = rep(c("dewitt", "stuttgart", "crossett"), 3),
                 firstday = rep(c("8/19/2009", "8/12/2009", "8/16/2009"),3), 
                 lastday = rep(c("8/19/2010", "8/12/2010", "8/16/2010"),3))

df
     school  firstday   lastday
1    dewitt 8/19/2009 8/19/2010
2 stuttgart 8/12/2009 8/12/2010
3  crossett 8/16/2009 8/16/2010
4    dewitt 8/19/2009 8/19/2010
5 stuttgart 8/12/2009 8/12/2010
6  crossett 8/16/2009 8/16/2010
7    dewitt 8/19/2009 8/19/2010
8 stuttgart 8/12/2009 8/12/2010
9  crossett 8/16/2009 8/16/2010

and running the dplyr::distinct() function over the data.frame:

library(dplyr)
distinct(df)
     school  firstday   lastday
1    dewitt 8/19/2009 8/19/2010
2 stuttgart 8/12/2009 8/12/2010
3  crossett 8/16/2009 8/16/2010

Returns only the three unique rows, because "every thing STATA can do, R can do better" :-)

mdag02
  • 1,035
  • 9
  • 16
DJV
  • 4,743
  • 3
  • 19
  • 34