0

I have a question that is somewhat similar to others that have been posted, but after looking thoroughly at several posts, I can't get the code to work. Any help would be much appreciated.

My data frame looks like, this:

'data.frame':   501 obs. of  5 variables:
 $ Tattoo.MUM   : Factor w/ 250 levels "1004","1007",..: 76 76 76 81 81 81 85 85 85 85 ...
 $ OffspringMUMs: int  4 4 4 4 4 4 11 11 11 11 ...
 $ YearBIRTH.CUB: int  1988 1990 1991 1988 1991 2007 1989 1991 1992 1993 ...
 $ YearBIRTH.MUM: int  1991 1991 NA NA NA NA 1987 1987 1987 1987 ...
 $ OFFSpYR      : int  2 1 1 1 2 1 1 4 3 3 ...

A few lines here:

structure(list(Tattoo.MUM = structure(c(6L, 6L, 6L, 6L, 7L, 7L, 
7L, 8L, 9L, 11L, 11L, 11L, 11L, 5L, 1L, 4L, 2L, 3L, 3L, 10L, 
10L, 10L, 10L, 10L), .Label = c("10454", "1045A", "1045X", "12392", 
"1601", "22", "27", "29", "41", "424X", "60"), class = "factor"), 
OffspringMUMs = c(11L, 11L, 11L, 11L, 5L, 5L, 5L, 1L, 3L, 
7L, 7L, 7L, 7L, 1L, 2L, 1L, 1L, 4L, 4L, 6L, 6L, 6L, 6L, 6L
), YearBIRTH.CUB = c(1989L, 1991L, 1992L, 1993L, 1990L, 1991L, 
1993L, 1989L, 1988L, 1988L, 1989L, 1991L, 1994L, 2015L, 2012L, 
2015L, 2005L, 2009L, 2010L, 1996L, 1998L, 2000L, 2001L, 2006L
), YearBIRTH.MUM = c(1987L, 1987L, 1987L, 1987L, NA, NA, 
NA, NA, NA, 1987L, 1987L, 1987L, 1987L, NA, NA, NA, NA, 2005L, 
2005L, 1994L, 1994L, 1994L, 1994L, 1994L), OFFSpYR = c(1L, 
4L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 1L, 
1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L)), .Names = c("Tattoo.MUM", 
"OffspringMUMs", "YearBIRTH.CUB", "YearBIRTH.MUM", "OFFSpYR"), class = "data.frame", row.names = c(NA, 
-24L))

I want to add new rows for all missing years (YearBIRTH.CUB) in Tattoo.MUM keeping the rest of the values the same and adding '0' to OFFSpYR.

Like so:

structure(list(Tattoo.MUM = structure(c(6L, 6L, 6L, 6L, 6L, 7L, 
7L, 7L, 7L, 8L, 9L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 5L, 1L, 
4L, 2L, 3L, 3L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L), .Label = c("10454", "1045A", "1045X", "12392", "1601", 
"22", "27", "29", "41", "424X", "60"), class = "factor"), OffspringMUMs = c(11L, 
11L, 11L, 11L, 11L, 5L, 5L, 5L, 5L, 1L, 3L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 1L, 2L, 1L, 1L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L), YearBIRTH.CUB = c(1989L, 1990L, 1991L, 1992L, 1993L, 
1990L, 1991L, 1992L, 1993L, 1989L, 1988L, 1988L, 1989L, 1990L, 
1991L, 1992L, 1993L, 1994L, 2015L, 2012L, 2015L, 2005L, 2009L, 
2010L, 1996L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 
2004L, 2005L, 2006L), YearBIRTH.MUM = c(1987L, 1987L, 1987L, 
1987L, 1987L, NA, NA, NA, NA, NA, NA, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, NA, NA, NA, NA, 2005L, 2005L, 1994L, 1994L, 
1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L, 1994L
), OFFSpYR = c(1L, 0L, 4L, 3L, 3L, 1L, 1L, 0L, 3L, 1L, 3L, 3L, 
1L, 0L, 2L, 0L, 0L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 0L, 1L, 0L, 
1L, 2L, 0L, 0L, 0L, 0L, 1L)), .Names = c("Tattoo.MUM", "OffspringMUMs", 
"YearBIRTH.CUB", "YearBIRTH.MUM", "OFFSpYR"), class = "data.frame", row.names = c(NA, 
-35L))

I've tried:

 library(tidyr)
 library(dplyr)

 df1 <- pedMUM %>% group_by(Tattoo.MUM, OffspringMUMs) %>% complete(YearBIRTH.CUB = full_seq(YearBIRTH.CUB,1)) %>% fill(OFFSpYR=0)

library(data.table)

df1 <- setDT(pedMUM)[CJ(Tattoo.MUM=Tattoo.MUM, OffspringMUMs=OffspringMUMs, YearBIRTH.MUM=YearBIRTH.MUM, YearBIRTH.CUB=seq(min(YearBIRTH.CUB), max(YearBIRTH.CUB)), unique=TRUE),
                 on=.(Tattoo.MUM, OffspringMUMs, YearBIRTH.CUB),  roll=T]

I am obviously using tidyr, dplyr, and data.table wrongly because none have given me the results I want.

I've had a look at the following posts:

Add rows with missing years by group

Adding rows with values of "0" to a dataframe with missing data

Find missing month after grouping with dplyr

And even tried loops:

R code - clever loop to add rows

but I get confused when I try to determine the year sequence for each Tattoo.MUM within the loop.

Would anyone be able to point me in the right direction?

  • Add the result from dput(head(df1)), instead of typing out a few lines, and check the lines you did write, not sure if they are clear enough. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Hector Haffenden Mar 15 '19 at 01:19
  • thank you! I've edited it as you suggested – ElisaPBadas16 Mar 15 '19 at 01:38

1 Answers1

2

I haven't used complete() before, but the following seems to work. nesting() allows you to keep two variables together, =full_seq() allows you to expand the values of a variable, fill=list() allows you to fill in blanks.

pedMUM <- structure(list(Tattoo.MUM = structure(c(6L, 6L, 6L, 6L, 7L, 7L,
7L, 8L, 9L, 11L, 11L, 11L, 11L, 5L, 1L, 4L, 2L, 3L, 3L, 10L,
10L, 10L, 10L, 10L), .Label = c("10454", "1045A", "1045X", "12392",
"1601", "22", "27", "29", "41", "424X", "60"), class = "factor"),
OffspringMUMs = c(11L, 11L, 11L, 11L, 5L, 5L, 5L, 1L, 3L,
7L, 7L, 7L, 7L, 1L, 2L, 1L, 1L, 4L, 4L, 6L, 6L, 6L, 6L, 6L
), YearBIRTH.CUB = c(1989L, 1991L, 1992L, 1993L, 1990L, 1991L,
1993L, 1989L, 1988L, 1988L, 1989L, 1991L, 1994L, 2015L, 2012L,
2015L, 2005L, 2009L, 2010L, 1996L, 1998L, 2000L, 2001L, 2006L
), YearBIRTH.MUM = c(1987L, 1987L, 1987L, 1987L, NA, NA,
NA, NA, NA, 1987L, 1987L, 1987L, 1987L, NA, NA, NA, NA, 2005L,
2005L, 1994L, 1994L, 1994L, 1994L, 1994L), OFFSpYR = c(1L,
4L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 2L, 1L, 1L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L)), .Names = c("Tattoo.MUM",
"OffspringMUMs", "YearBIRTH.CUB", "YearBIRTH.MUM", "OFFSpYR"), class = "data.frame", row.names = c(NA,
-24L))

library(tidyr)
library(dplyr)

df1 <- pedMUM %>%
    group_by(Tattoo.MUM) %>% # find min and max year for each mum
    mutate(
        minyear=min(YearBIRTH.CUB, na.rm=TRUE),
        maxyear=max(YearBIRTH.CUB, na.rm=TRUE)
    ) %>% 
    complete( # complete table
        nesting(Tattoo.MUM, minyear, maxyear, OffspringMUMs, YearBIRTH.MUM),
        YearBIRTH.CUB=full_seq(YearBIRTH.CUB, 1),
        fill=list(OFFSpYR=0)
        ) %>% 
    filter(YearBIRTH.CUB>=minyear & YearBIRTH.CUB<=maxyear) %>% # remove unwanted years
    select(names(pedMUM))  # return original column order
Simon Woodward
  • 1,946
  • 1
  • 16
  • 24
  • Thanks for your help, but none of those are working... The first option provides an extra row for each year in the dataset so not really what I wanted, and the second one gives the following error: Error in if (any((x - rng[1])%%period > tol)) { : missing value where TRUE/FALSE needed – ElisaPBadas16 Mar 18 '19 at 15:45
  • Still doesn't work if I use the complete dataset (no NAs) and just the 3 variables I'm interested in: df1 <- pedMUM %>% complete(YrBIRTH.CUBS,nesting(Tattoo.MUM, YrBIRTH.MUMS), YrBIRTH.CUBS=full_seq(YrBIRTH.CUBS, 1), fill=list(OFFSpYR=0)) – ElisaPBadas16 Mar 18 '19 at 16:05
  • I've now included the complete script which works for me. You have `YrBIRTH.CUBS` appearing twice in your code, which is probably not allowed. Also check the spelling and case of your variable names. – Simon Woodward Mar 18 '19 at 20:17
  • Ah so you don't actually want the full sequence of years for each mother. – Simon Woodward Mar 18 '19 at 20:23
  • Thanks! (apologies about the variables names, I was sloppy and was using a different data frame today). For some reason the filter line of code gave me an error: Error: All select() inputs must resolve to integer column positions. The following do not: * names(pedMUM) – ElisaPBadas16 Mar 18 '19 at 23:34
  • 1
    but I filtered out the unwanted years in a separate line and that it worked! – ElisaPBadas16 Mar 18 '19 at 23:35
  • Is your version of R and all your packages packages up to date? – Simon Woodward Mar 19 '19 at 00:59