adding rows with values of '0' for missing years

Question

I have a question that is somewhat similar to others that have been posted, but after looking thoroughly at several posts, I can't get the code to work. Any help would be much appreciated.

My data frame looks like, this:

'data.frame':   501 obs. of  5 variables:
 $ Tattoo.MUM   : Factor w/ 250 levels "1004","1007",..: 76 76 76 81 81 81 85 85 85 85 ...
 $ OffspringMUMs: int  4 4 4 4 4 4 11 11 11 11 ...
 $ YearBIRTH.CUB: int  1988 1990 1991 1988 1991 2007 1989 1991 1992 1993 ...
 $ YearBIRTH.MUM: int  1991 1991 NA NA NA NA 1987 1987 1987 1987 ...
 $ OFFSpYR      : int  2 1 1 1 2 1 1 4 3 3 ...

A few lines here:

structure(list(Tattoo.MUM = structure(c(6L, 6L, 6L, 6L, 7L, 7L, 
7L, 8L, 9L, 11L, 11L, 11L, 11L, 5L, 1L, 4L, 2L, 3L, 3L, 10L, 
10L, 10L, 10L, 10L), .Label = c("10454", "1045A", "1045X", "12392", 
"1601", "22", "27", "29", "41", "424X", "60"), class = "factor"), 
OffspringMUMs = c(11L, 11L, 11L, 11L, 5L, 5L, 5L, 1L, 3L, 
7L, 7L, 7L, 7L, 1L, 2L, 1L, 1L, 4L, 4L, 6L, 6L, 6L, 6L, 6L
), YearBIRTH.CUB = c(1989L, 1991L, 1992L, 1993L, 1990L, 1991L, 
1993L, 1989L, 1988L, 1988L, 1989L, 1991L, 1994L, 2015L, 2012L, 
2015L, 2005L, 2009L, 2010L, 1996L, 1998L, 2000L, 2001L, 2006L
), YearBIRTH.MUM = c(1987L, 1987L, 1987L, 1987L, NA, NA, 
NA, NA, NA, 1987L, 1987L, 1987L, 1987L, NA, NA, NA, NA, 2005L, 
2005L, 1994L, 1994L, 1994L, 1994L, 1994L), OFFSpYR = c(1L, 
4L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L)), .Names = c("Tattoo.MUM", 
"OffspringMUMs", "YearBIRTH.CUB", "YearBIRTH.MUM", "OFFSpYR"), class = "data.frame", row.names = c(NA, 
-24L))

I want to add new rows for all missing years (YearBIRTH.CUB) in Tattoo.MUM keeping the rest of the values the same and adding '0' to OFFSpYR.

Like so:

structure(list(Tattoo.MUM = structure(c(6L, 6L, 6L, 6L, 6L, 7L, 
7L, 7L, 7L, 8L, 9L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 5L, 1L, 
4L, 2L, 3L, 3L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L), .Label = c("10454", "1045A", "1045X", "12392", "1601", 
"22", "27", "29", "41", "424X", "60"), class = "factor"), OffspringMUMs = c(11L, 
11L, 11L, 11L, 11L, 5L, 5L, 5L, 5L, 1L, 3L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 1L, 2L, 1L, 1L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L), YearBIRTH.CUB = c(1989L, 1990L, 1991L, 1992L, 1993L, 
1990L, 1991L, 1992L, 1993L, 1989L, 1988L, 1988L, 1989L, 1990L, 
1991L, 1992L, 1993L, 1994L, 2015L, 2012L, 2015L, 2005L, 2009L, 
2010L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 
2004L, 2005L, 2006L), YearBIRTH.MUM = c(1987L, 1987L, 1987L, 
1987L, 1987L, NA, NA, NA, NA, NA, NA, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, NA, NA, NA, NA, 2005L, 2005L, 1994L, 1994L, 
1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L
), OFFSpYR = c(1L, 0L, 4L, 3L, 3L, 1L, 1L, 0L, 3L, 1L, 3L, 3L, 
1L, 0L, 2L, 0L, 0L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 0L, 1L, 0L, 
1L, 2L, 0L, 0L, 0L, 0L, 1L)), .Names = c("Tattoo.MUM", "OffspringMUMs", 
"YearBIRTH.CUB", "YearBIRTH.MUM", "OFFSpYR"), class = "data.frame", row.names = c(NA, 
-35L))

I've tried:

 library(tidyr)
 library(dplyr)

 df1 <- pedMUM %>% group_by(Tattoo.MUM, OffspringMUMs) %>% complete(YearBIRTH.CUB = full_seq(YearBIRTH.CUB,1)) %>% fill(OFFSpYR=0)

library(data.table)

df1 <- setDT(pedMUM)[CJ(Tattoo.MUM=Tattoo.MUM, OffspringMUMs=OffspringMUMs, YearBIRTH.MUM=YearBIRTH.MUM, YearBIRTH.CUB=seq(min(YearBIRTH.CUB), max(YearBIRTH.CUB)), unique=TRUE),
                 on=.(Tattoo.MUM, OffspringMUMs, YearBIRTH.CUB),  roll=T]

I am obviously using tidyr, dplyr, and data.table wrongly because none have given me the results I want.

I've had a look at the following posts:

Add rows with missing years by group

Adding rows with values of "0" to a dataframe with missing data

Find missing month after grouping with dplyr

And even tried loops:

R code - clever loop to add rows

but I get confused when I try to determine the year sequence for each Tattoo.MUM within the loop.

Would anyone be able to point me in the right direction?

Add the result from dput(head(df1)), instead of typing out a few lines, and check the lines you did write, not sure if they are clear enough. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Hector Haffenden, Mar 15 '19 at 01:19

Simon Woodward · Accepted Answer · 2019-03-18T20:49:13.237

2

I haven't used complete() before, but the following seems to work. nesting() allows you to keep two variables together, =full_seq() allows you to expand the values of a variable, fill=list() allows you to fill in blanks.

pedMUM <- structure(list(Tattoo.MUM = structure(c(6L, 6L, 6L, 6L, 7L, 7L,
7L, 8L, 9L, 11L, 11L, 11L, 11L, 5L, 1L, 4L, 2L, 3L, 3L, 10L,
10L, 10L, 10L, 10L), .Label = c("10454", "1045A", "1045X", "12392",
"1601", "22", "27", "29", "41", "424X", "60"), class = "factor"),
OffspringMUMs = c(11L, 11L, 11L, 11L, 5L, 5L, 5L, 1L, 3L,
7L, 7L, 7L, 7L, 1L, 2L, 1L, 1L, 4L, 4L, 6L, 6L, 6L, 6L, 6L
), YearBIRTH.CUB = c(1989L, 1991L, 1992L, 1993L, 1990L, 1991L,
1993L, 1989L, 1988L, 1988L, 1989L, 1991L, 1994L, 2015L, 2012L,
2015L, 2005L, 2009L, 2010L, 1996L, 1998L, 2000L, 2001L, 2006L
), YearBIRTH.MUM = c(1987L, 1987L, 1987L, 1987L, NA, NA,
NA, NA, NA, 1987L, 1987L, 1987L, 1987L, NA, NA, NA, NA, 2005L,
2005L, 1994L, 1994L, 1994L, 1994L, 1994L), OFFSpYR = c(1L,
4L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L)), .Names = c("Tattoo.MUM",
"OffspringMUMs", "YearBIRTH.CUB", "YearBIRTH.MUM", "OFFSpYR"), class = "data.frame", row.names = c(NA,
-24L))

library(tidyr)
library(dplyr)

df1 <- pedMUM %>%
    group_by(Tattoo.MUM) %>% # find min and max year for each mum
    mutate(
        minyear=min(YearBIRTH.CUB, na.rm=TRUE),
        maxyear=max(YearBIRTH.CUB, na.rm=TRUE)
    ) %>% 
    complete( # complete table
        nesting(Tattoo.MUM, minyear, maxyear, OffspringMUMs, YearBIRTH.MUM),
        YearBIRTH.CUB=full_seq(YearBIRTH.CUB, 1),
        fill=list(OFFSpYR=0)
        ) %>% 
    filter(YearBIRTH.CUB>=minyear & YearBIRTH.CUB<=maxyear) %>% # remove unwanted years
    select(names(pedMUM))  # return original column order

edited Mar 18 '19 at 20:49

answered Mar 15 '19 at 01:53

Simon Woodward

1,946
1
16
24

Thanks for your help, but none of those are working... The first option provides an extra row for each year in the dataset so not really what I wanted, and the second one gives the following error: Error in if (any((x - rng[1])%%period > tol)) { : missing value where TRUE/FALSE needed – ElisaPBadas16 Mar 18 '19 at 15:45
Still doesn't work if I use the complete dataset (no NAs) and just the 3 variables I'm interested in: df1 <- pedMUM %>% complete(YrBIRTH.CUBS,nesting(Tattoo.MUM, YrBIRTH.MUMS), YrBIRTH.CUBS=full_seq(YrBIRTH.CUBS, 1), fill=list(OFFSpYR=0)) – ElisaPBadas16 Mar 18 '19 at 16:05
I've now included the complete script which works for me. You have `YrBIRTH.CUBS` appearing twice in your code, which is probably not allowed. Also check the spelling and case of your variable names. – Simon Woodward Mar 18 '19 at 20:17
Ah so you don't actually want the full sequence of years for each mother. – Simon Woodward Mar 18 '19 at 20:23
Thanks! (apologies about the variables names, I was sloppy and was using a different data frame today). For some reason the filter line of code gave me an error: Error: All select() inputs must resolve to integer column positions. The following do not: * names(pedMUM) – ElisaPBadas16 Mar 18 '19 at 23:34
1

but I filtered out the unwanted years in a separate line and that it worked! – ElisaPBadas16 Mar 18 '19 at 23:35
Is your version of R and all your packages packages up to date? – Simon Woodward Mar 19 '19 at 00:59

adding rows with values of '0' for missing years

1 Answers1