How to determine the longest timeperiod and exclude other rows in Excel or R?

Question

In my dataset I have information of the ZIPCODE of 600K+ ID's. If ID's move to a different addressess, I want to determine at which zipcode they lived the longest and put a '1' for that specific year in that row (no need to combine rows as I want to know if they where they lived in what year). That way an ID only have a '1' for a certain year at one row (if there are multiple rows for that ID). The yellow highlight is what i don't want; in that case there is a '1' in two rows for the same year. In the preferred dataset there is only one '1' per year per ID possible.

For example: ID 4 lived in 2013 in 2 places (NY and LA), therefore there are 2 rows. At this point there is a 1 in each row for 2013 and I only want a 1 in the row the ID lived the longest between 1-1-2013 and 31-12-2018. ID 4 lived in 2013 longer in LA than in NY, and so only a 1 should be at the row for NY (so in this case the row of LA will be removed because only '0's remain).

I can also put this file in RStudio.

Thank you!

structure(v1)
   ID    CITY ZIPCODE DATE_START   DATE_END DATE_END.1 X2013 X2014 X2015 X2016 X2017 X2018
1   1      NY  1234EF  1-12-2003            31-12-2018     1     1     1     1     1     1
2   2      NY  1234CD  1-12-2003  14-1-2019  14-1-2019     1     1     1     1     1     1
3   2      NY  1234AB  15-1-2019            31-12-2018     0     0     0     0     0     0
4   3      NY  1234AB  15-1-2019            31-12-2018     0     0     0     0     0     0
5   3      NY  1234CD  1-12-2003  14-1-2019  14-1-2019     1     1     1     1     1     1
6   4      LA  1111AB   4-5-2013            31-12-2018     1     1     1     1     1     1
7   4      NY  2222AB  1-12-2003   3-5-2013   3-5-2013     1     0     0     0     0     0
8   5   MIAMI  5555CD   6-2-2015  20-6-2016  20-6-2016     0     0     1     1     0     0
9   5   VEGAS  3333AB   1-1-2004            31-12-2018     1     1     1     1     1     1
10  5 ORLANDO  4444AB  26-2-2004   5-2-2015   5-2-2015     1     1     1     0     0     0
11  5   MIAMI  5555AB  21-6-2016 31-12-2018 31-12-2018     0     0     0     1     1     1
12  5   MIAMI  5555AB   1-1-2019            31-12-2018     0     0     0     0     0     0
13  6  AUSTIN  6666AB  28-2-2017  3-11-2017  3-11-2017     0     0     0     0     1     0
14  6  AUSTIN  6666AB  4-11-2017            31-12-2018     0     0     0     0     1     1
15  6  AUSTIN  7777AB  20-1-2017  27-2-2017  27-2-2017     0     0     0     0     1     0
16  6  AUSTIN  8888AB  1-12-2003  19-1-2017  19-1-2017     1     1     1     1     1     0
> 


structure(list(ID = c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 6L, 6L, 6L, 6L), CITY = structure(c(4L, 4L, 4L, 4L, 4L, 
2L, 4L, 3L, 6L, 5L, 3L, 3L, 1L, 1L, 1L, 1L), .Label = c("AUSTIN", 
"LA", "MIAMI", "NY", "ORLANDO", "VEGAS"), class = "factor"), 
    ZIPCODE = structure(c(4L, 3L, 2L, 2L, 3L, 1L, 5L, 9L, 6L, 
    7L, 8L, 8L, 10L, 10L, 11L, 12L), .Label = c("1111AB", "1234AB", 
    "1234CD", "1234EF", "2222AB", "3333AB", "4444AB", "5555AB", 
    "5555CD", "6666AB", "7777AB", "8888AB"), class = "factor"), 
    DATE_START = structure(c(3L, 3L, 4L, 4L, 3L, 10L, 3L, 11L, 
    1L, 7L, 6L, 2L, 8L, 9L, 5L, 3L), .Label = c("1-1-2004", "1-1-2019", 
    "1-12-2003", "15-1-2019", "20-1-2017", "21-6-2016", "26-2-2004", 
    "28-2-2017", "4-11-2017", "4-5-2013", "6-2-2015"), class = "factor"), 
    DATE_END = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 7L, 4L, 1L, 
    9L, 8L, 1L, 6L, 1L, 5L, 3L), .Label = c("", "14-1-2019", 
    "19-1-2017", "20-6-2016", "27-2-2017", "3-11-2017", "3-5-2013", 
    "31-12-2018", "5-2-2015"), class = "factor"), DATE_END.1 = structure(c(7L, 
    1L, 7L, 7L, 1L, 7L, 6L, 3L, 7L, 8L, 7L, 7L, 5L, 7L, 4L, 2L
    ), .Label = c("14-1-2019", "19-1-2017", "20-6-2016", "27-2-2017", 
    "3-11-2017", "3-5-2013", "31-12-2018", "5-2-2015"), class = "factor"), 
    X2013 = c(1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 
    0L, 0L, 0L, 1L), X2014 = c(1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 
    1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L), X2015 = c(1L, 1L, 0L, 0L, 
    1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L), X2016 = c(1L, 
    1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L
    ), X2017 = c(1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 
    0L, 1L, 1L, 1L, 1L), X2018 = c(1L, 1L, 0L, 0L, 1L, 1L, 0L, 
    0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA, 
-16L))

HI Student0172, this is an easy operation in R, but if you want to increase the likelihood of a response, you will be more likely to get an answer if you provide minimal reproducible data. See [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more info. — Ian Campbell, May 06 '20 at 18:24
Hi Ian, thank you for your anwser. I added R structure to my question. — Student0172, May 06 '20 at 18:28
Hi Ian, I tried your previous code from the other command (not sure where it went?). Thank you for your anwser. I've load the dplyr package, but I still get this error: Error in dmy(DATE_END.1) : could not find function "dmy".. Do you have any idea? — Student0172, May 06 '20 at 18:45

Ian Campbell · Answer 1 · 2020-05-06T19:16:10.673

You can use a little help from the lubridate package to calculate how many days are spent at each location. Then we can group_by ID and use case_when to assign 1 when the time is the max or 0 otherwise.

library(lubridate)
library(dplyr)
v1 %>%
  dplyr::select(ID,CITY,ZIPCODE,DATE_START,DATE_END.1) %>%
  rowwise() %>%
  mutate("X2013" = max(0, min(dmy("31-12-2013"),dmy(DATE_END.1)) - max(dmy("1-1-2013"),dmy(DATE_START))),
         "X2014" = max(0, min(dmy("31-12-2014"),dmy(DATE_END.1)) - max(dmy("1-1-2014"),dmy(DATE_START))),
         "X2015" = max(0, min(dmy("31-12-2015"),dmy(DATE_END.1)) - max(dmy("1-1-2015"),dmy(DATE_START))),
         "X2016" = max(0, min(dmy("31-12-2016"),dmy(DATE_END.1)) - max(dmy("1-1-2016"),dmy(DATE_START))),
         "X2017" = max(0, min(dmy("31-12-2017"),dmy(DATE_END.1)) - max(dmy("1-1-2017"),dmy(DATE_START))),
         "X2018" = max(0, min(dmy("31-12-2018"),dmy(DATE_END.1)) - max(dmy("1-1-2018"),dmy(DATE_START)))) %>%
  ungroup %>%
  group_by(ID) %>%
  mutate_at(vars(starts_with("X")),list(~ case_when(. == max(.) ~ 1,
                                                    TRUE ~ 0)))
# A tibble: 16 x 11
# Groups:   ID [6]
      ID CITY    ZIPCODE DATE_START DATE_END.1 X2013 X2014 X2015 X2016 X2017 X2018
   <int> <fct>   <fct>   <fct>      <fct>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1 NY      1234EF  1-12-2003  31-12-2018     1     1     1     1     1     1
 2     2 NY      1234CD  1-12-2003  14-1-2019      1     1     1     1     1     1
 3     2 NY      1234AB  15-1-2019  31-12-2018     0     0     0     0     0     0
 4     3 NY      1234AB  15-1-2019  31-12-2018     0     0     0     0     0     0
 5     3 NY      1234CD  1-12-2003  14-1-2019      1     1     1     1     1     1
 6     4 LA      1111AB  4-5-2013   31-12-2018     1     1     1     1     1     1
 7     4 NY      2222AB  1-12-2003  3-5-2013       0     0     0     0     0     0
 8     5 MIAMI   5555CD  6-2-2015   20-6-2016      0     0     0     0     0     0
 9     5 VEGAS   3333AB  1-1-2004   31-12-2018     1     1     1     1     1     1
10     5 ORLANDO 4444AB  26-2-2004  5-2-2015       1     1     0     0     0     0
11     5 MIAMI   5555AB  21-6-2016  31-12-2018     0     0     0     0     1     1
12     5 MIAMI   5555AB  1-1-2019   31-12-2018     0     0     0     0     0     0
13     6 AUSTIN  6666AB  28-2-2017  3-11-2017      0     0     0     0     1     0
14     6 AUSTIN  6666AB  4-11-2017  31-12-2018     0     0     0     0     0     1
15     6 AUSTIN  7777AB  20-1-2017  27-2-2017      0     0     0     0     0     0
16     6 AUSTIN  8888AB  1-12-2003  19-1-2017      1     1     1     1     0     0

There is certainly a way that one could implement the first mutate call to not require manually writing each year, but would take much more work than just typing it out.

Hi Ian, thank you for your help! Unfortunately I got this error: Grouping rowwise data frame strips rowwise nature. Online I found something about ungrouping de data again. Are you familiar with this error? About mutate: I'm grateful for this anwser because I can understand it better this way! — Student0172, May 06 '20 at 19:15
Sorry about that, I am using a development version of `dplyr` and did not realize that would be a problem. I edited my answer. — Ian Campbell, May 06 '20 at 19:16
Thank you so much Ian, it worked! I added v2 <- v1 at the beginning (otherwise it didn't change my dataframe. Hopefully it will work on my huge dataframe as well. Thank you! — Student0172, May 06 '20 at 19:23
Hi Ian, I triend your code om my original dataset (990K rows). However, I left it running all night and nothing happend. I'm currently running it again. Do you have any idea why this is? Thank you! — Student0172, May 07 '20 at 07:37

How to determine the longest timeperiod and exclude other rows in Excel or R?

1 Answers1