1

In my dataset I have information of the ZIPCODE of 600K+ ID's. If ID's move to a different addressess, I want to determine at which zipcode they lived the longest and put a '1' for that specific year in that row (no need to combine rows as I want to know if they where they lived in what year). That way an ID only have a '1' for a certain year at one row (if there are multiple rows for that ID). The yellow highlight is what i don't want; in that case there is a '1' in two rows for the same year. In the preferred dataset there is only one '1' per year per ID possible.

For example: ID 4 lived in 2013 in 2 places (NY and LA), therefore there are 2 rows. At this point there is a 1 in each row for 2013 and I only want a 1 in the row the ID lived the longest between 1-1-2013 and 31-12-2018. ID 4 lived in 2013 longer in LA than in NY, and so only a 1 should be at the row for NY (so in this case the row of LA will be removed because only '0's remain).

I can also put this file in RStudio.

Thank you!

structure(v1)
   ID    CITY ZIPCODE DATE_START   DATE_END DATE_END.1 X2013 X2014 X2015 X2016 X2017 X2018
1   1      NY  1234EF  1-12-2003            31-12-2018     1     1     1     1     1     1
2   2      NY  1234CD  1-12-2003  14-1-2019  14-1-2019     1     1     1     1     1     1
3   2      NY  1234AB  15-1-2019            31-12-2018     0     0     0     0     0     0
4   3      NY  1234AB  15-1-2019            31-12-2018     0     0     0     0     0     0
5   3      NY  1234CD  1-12-2003  14-1-2019  14-1-2019     1     1     1     1     1     1
6   4      LA  1111AB   4-5-2013            31-12-2018     1     1     1     1     1     1
7   4      NY  2222AB  1-12-2003   3-5-2013   3-5-2013     1     0     0     0     0     0
8   5   MIAMI  5555CD   6-2-2015  20-6-2016  20-6-2016     0     0     1     1     0     0
9   5   VEGAS  3333AB   1-1-2004            31-12-2018     1     1     1     1     1     1
10  5 ORLANDO  4444AB  26-2-2004   5-2-2015   5-2-2015     1     1     1     0     0     0
11  5   MIAMI  5555AB  21-6-2016 31-12-2018 31-12-2018     0     0     0     1     1     1
12  5   MIAMI  5555AB   1-1-2019            31-12-2018     0     0     0     0     0     0
13  6  AUSTIN  6666AB  28-2-2017  3-11-2017  3-11-2017     0     0     0     0     1     0
14  6  AUSTIN  6666AB  4-11-2017            31-12-2018     0     0     0     0     1     1
15  6  AUSTIN  7777AB  20-1-2017  27-2-2017  27-2-2017     0     0     0     0     1     0
16  6  AUSTIN  8888AB  1-12-2003  19-1-2017  19-1-2017     1     1     1     1     1     0
> 


structure(list(ID = c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 6L, 6L, 6L, 6L), CITY = structure(c(4L, 4L, 4L, 4L, 4L, 
2L, 4L, 3L, 6L, 5L, 3L, 3L, 1L, 1L, 1L, 1L), .Label = c("AUSTIN", 
"LA", "MIAMI", "NY", "ORLANDO", "VEGAS"), class = "factor"), 
    ZIPCODE = structure(c(4L, 3L, 2L, 2L, 3L, 1L, 5L, 9L, 6L, 
    7L, 8L, 8L, 10L, 10L, 11L, 12L), .Label = c("1111AB", "1234AB", 
    "1234CD", "1234EF", "2222AB", "3333AB", "4444AB", "5555AB", 
    "5555CD", "6666AB", "7777AB", "8888AB"), class = "factor"), 
    DATE_START = structure(c(3L, 3L, 4L, 4L, 3L, 10L, 3L, 11L, 
    1L, 7L, 6L, 2L, 8L, 9L, 5L, 3L), .Label = c("1-1-2004", "1-1-2019", 
    "1-12-2003", "15-1-2019", "20-1-2017", "21-6-2016", "26-2-2004", 
    "28-2-2017", "4-11-2017", "4-5-2013", "6-2-2015"), class = "factor"), 
    DATE_END = structure(c(1L, 2L, 1L, 1L, 2L, 1L, 7L, 4L, 1L, 
    9L, 8L, 1L, 6L, 1L, 5L, 3L), .Label = c("", "14-1-2019", 
    "19-1-2017", "20-6-2016", "27-2-2017", "3-11-2017", "3-5-2013", 
    "31-12-2018", "5-2-2015"), class = "factor"), DATE_END.1 = structure(c(7L, 
    1L, 7L, 7L, 1L, 7L, 6L, 3L, 7L, 8L, 7L, 7L, 5L, 7L, 4L, 2L
    ), .Label = c("14-1-2019", "19-1-2017", "20-6-2016", "27-2-2017", 
    "3-11-2017", "3-5-2013", "31-12-2018", "5-2-2015"), class = "factor"), 
    X2013 = c(1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 
    0L, 0L, 0L, 1L), X2014 = c(1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 
    1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L), X2015 = c(1L, 1L, 0L, 0L, 
    1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L), X2016 = c(1L, 
    1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L
    ), X2017 = c(1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 
    0L, 1L, 1L, 1L, 1L), X2018 = c(1L, 1L, 0L, 0L, 1L, 1L, 0L, 
    0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA, 
-16L))

enter image description here

  • HI Student0172, this is an easy operation in R, but if you want to increase the likelihood of a response, you will be more likely to get an answer if you provide minimal reproducible data. See [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more info. – Ian Campbell May 06 '20 at 18:24
  • Hi Ian, thank you for your anwser. I added R structure to my question. – Student0172 May 06 '20 at 18:28
  • Hi Ian, I tried your previous code from the other command (not sure where it went?). Thank you for your anwser. I've load the dplyr package, but I still get this error: Error in dmy(DATE_END.1) : could not find function "dmy".. Do you have any idea? – Student0172 May 06 '20 at 18:45

1 Answers1

1

You can use a little help from the lubridate package to calculate how many days are spent at each location. Then we can group_by ID and use case_when to assign 1 when the time is the max or 0 otherwise.

library(lubridate)
library(dplyr)
v1 %>%
  dplyr::select(ID,CITY,ZIPCODE,DATE_START,DATE_END.1) %>%
  rowwise() %>%
  mutate("X2013" = max(0, min(dmy("31-12-2013"),dmy(DATE_END.1)) - max(dmy("1-1-2013"),dmy(DATE_START))),
         "X2014" = max(0, min(dmy("31-12-2014"),dmy(DATE_END.1)) - max(dmy("1-1-2014"),dmy(DATE_START))),
         "X2015" = max(0, min(dmy("31-12-2015"),dmy(DATE_END.1)) - max(dmy("1-1-2015"),dmy(DATE_START))),
         "X2016" = max(0, min(dmy("31-12-2016"),dmy(DATE_END.1)) - max(dmy("1-1-2016"),dmy(DATE_START))),
         "X2017" = max(0, min(dmy("31-12-2017"),dmy(DATE_END.1)) - max(dmy("1-1-2017"),dmy(DATE_START))),
         "X2018" = max(0, min(dmy("31-12-2018"),dmy(DATE_END.1)) - max(dmy("1-1-2018"),dmy(DATE_START)))) %>%
  ungroup %>%
  group_by(ID) %>%
  mutate_at(vars(starts_with("X")),list(~ case_when(. == max(.) ~ 1,
                                                    TRUE ~ 0)))
# A tibble: 16 x 11
# Groups:   ID [6]
      ID CITY    ZIPCODE DATE_START DATE_END.1 X2013 X2014 X2015 X2016 X2017 X2018
   <int> <fct>   <fct>   <fct>      <fct>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1 NY      1234EF  1-12-2003  31-12-2018     1     1     1     1     1     1
 2     2 NY      1234CD  1-12-2003  14-1-2019      1     1     1     1     1     1
 3     2 NY      1234AB  15-1-2019  31-12-2018     0     0     0     0     0     0
 4     3 NY      1234AB  15-1-2019  31-12-2018     0     0     0     0     0     0
 5     3 NY      1234CD  1-12-2003  14-1-2019      1     1     1     1     1     1
 6     4 LA      1111AB  4-5-2013   31-12-2018     1     1     1     1     1     1
 7     4 NY      2222AB  1-12-2003  3-5-2013       0     0     0     0     0     0
 8     5 MIAMI   5555CD  6-2-2015   20-6-2016      0     0     0     0     0     0
 9     5 VEGAS   3333AB  1-1-2004   31-12-2018     1     1     1     1     1     1
10     5 ORLANDO 4444AB  26-2-2004  5-2-2015       1     1     0     0     0     0
11     5 MIAMI   5555AB  21-6-2016  31-12-2018     0     0     0     0     1     1
12     5 MIAMI   5555AB  1-1-2019   31-12-2018     0     0     0     0     0     0
13     6 AUSTIN  6666AB  28-2-2017  3-11-2017      0     0     0     0     1     0
14     6 AUSTIN  6666AB  4-11-2017  31-12-2018     0     0     0     0     0     1
15     6 AUSTIN  7777AB  20-1-2017  27-2-2017      0     0     0     0     0     0
16     6 AUSTIN  8888AB  1-12-2003  19-1-2017      1     1     1     1     0     0

There is certainly a way that one could implement the first mutate call to not require manually writing each year, but would take much more work than just typing it out.

Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • Hi Ian, thank you for your help! Unfortunately I got this error: Grouping rowwise data frame strips rowwise nature. Online I found something about ungrouping de data again. Are you familiar with this error? About mutate: I'm grateful for this anwser because I can understand it better this way! – Student0172 May 06 '20 at 19:15
  • Sorry about that, I am using a development version of `dplyr` and did not realize that would be a problem. I edited my answer. – Ian Campbell May 06 '20 at 19:16
  • Thank you so much Ian, it worked! I added v2 <- v1 at the beginning (otherwise it didn't change my dataframe. Hopefully it will work on my huge dataframe as well. Thank you! – Student0172 May 06 '20 at 19:23
  • Hi Ian, I triend your code om my original dataset (990K rows). However, I left it running all night and nothing happend. I'm currently running it again. Do you have any idea why this is? Thank you! – Student0172 May 07 '20 at 07:37