1

I have a df that of 32 columns and just under a million rows. The columns are the POINTID (individual id), First (year that an event first happened), and then 30 columns of years w binary occurrence data. I would like the first occurrence in each row (currently stored as a 1, same as all other occurrences) to be changed to a 2, so that I can differentiate between the first event and repeat events. I've tried doing this with the tidyverse, but even then it is taking forever. I can't tell if my code is just wrong or if it's not computationally efficient enough. I tested it on a smaller dataset and it seemed to work, in the long format but not the wide, so I'm thinking it's an efficiency issue because the pivot_longer table generated is about about 35 million rows long.

Can anyone help me understand why this isn't working or how to do it in a way that computes faster?

classifications %>%
  pivot_longer(-c(1,32),names_to="Years", values_to="Present")%>%
  group_by(POINTID)%>%
  mutate(Present=replace(Present, Years==first, 2))

A reduced version of my DF is below:

> dput(classifications)
structure(list(POINTID = 2:11, first = structure(c(33L, 33L, 
33L, 33L, 1L, 33L, 33L, 1L, 1L, 36L), .Label = c("X1985", "X1986", 
"X1987", "X1988", "X1989", "X1990", "X1991", "X1992", "X1993", 
"X1994", "X1995", "X1996", "X1997", "X1998", "X1999", "X2000", 
"X2001", "X2002", "X2003", "X2004", "X2005", "X2006", "X2007", 
"X2008", "X2009", "X2010", "X2011", "X2012", "X2013", "X2014", 
"X2015", "X2016", "X2017", "X2018", "X2019", "X2020"), class = "factor"), 
    X1990 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), X1991 = c(0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0), X1992 = c(0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0), X1993 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), X1994 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), X1995 = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), X1996 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), X1997 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), X1998 = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), X1999 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), X2000 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), X2001 = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), X2002 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), X2003 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), X2004 = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), X2005 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), X2006 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), X2007 = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), X2008 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), X2009 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), X2010 = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), X2011 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), X2012 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), X2013 = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), X2014 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), X2015 = c(0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), X2016 = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), X2017 = c(1, 1, 1, 1, 0, 1, 1, 0, 0, 0), X2018 = c(1, 
    0, 0, 0, 0, 0, 0, 0, 0, 0), X2019 = c(0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), X2020 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1)), row.names = c(NA, 
10L), class = "data.frame")
Rachel
  • 15
  • 3
  • thank you for your comment! I included screenshots in the post edit. Hopefully that makes it clearer – Rachel Aug 07 '20 at 01:12
  • oh, I see! Sorry, I thought you just meant to see what the DF looks like... o_o clearly brain fried and didn't parse. Thank you for the link, hopefully the mini DF I updated with is sufficient – Rachel Aug 07 '20 at 01:18
  • Some of the `first` values are not present as columns in the example. For example, `X1985`. Is that the case only for this mini DF and will not be the case in original data? – Ronak Shah Aug 07 '20 at 01:29

1 Answers1

0

You can do this keeping the data in wide format with vectorised operations of row/column subsetting. We get the column index using match.

mat <- cbind(1:nrow(classifications), 
             match(classifications$first, names(classifications)))

classifications[mat] <- 2
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213