0

I have the following data:

library(data.table)  
df <- fread(
    "A   B  C  D  E  F  iso   year   
    0   A   NA  1  NA  NA  NLD   2009   
    1   Y   NA  2  NA  NA  NLD   2010   
    0   Q   NA  3  NA  NA  AUS   2011   
    1   NA  NA  4  NA  NA  AUS   2012 
    0   0   NA  7  NA  NA  NLD   2011  
    0   NA  1   NA  1  NA  ECU   2009   
    1   NA  0   NA  2  0   ECU   2010    
    1   NA  0   NA  2  0   ECU   2011    
    0   NA  0   NA  3  2   BRA   2011   
    1   NA  0   NA  4  NA  BRA   2012",
    header = TRUE
)

I want to extrapolate for each country, the entry of the last year in the dataset,up to a particular year. For example, I want to use the last year for every country in df (NLD=2011, AUS=2012, BRA=2012, ECU=2011), to create entries up to 2014. In other words, I would like to make duplicates of those entries, but with the years changed, of the last available year up to 2014.

Desired output:

library(data.table)  
df <- fread(
    "A   B  C  D  E  F  iso   year   
    0   A   NA  1  NA  NA  NLD   2009   
    1   Y   NA  2  NA  NA  NLD   2010   
    0   Q   NA  3  NA  NA  AUS   2011   
    1   NA  NA  4  NA  NA  AUS   2012
    1   NA  NA  4  NA  NA  AUS   2013 
    1   NA  NA  4  NA  NA  AUS   2014    
    0   0   NA  7  NA  NA  NLD   2011  
    0   0   NA  7  NA  NA  NLD   2012 
    0   0   NA  7  NA  NA  NLD   2013 
    0   0   NA  7  NA  NA  NLD   2014 
    0   NA  1   NA  1  NA  ECU   2009   
    1   NA  0   NA  2  0   ECU   2010   
    0   NA  0   NA  3  0   BRA   2011   
    1   NA  0   NA  4  0   BRA   2011   
    1   NA  0   NA  2  0   ECU   2011   
    1   NA  0   NA  2  0   ECU   2012
    1   NA  0   NA  2  0   ECU   2013 
    1   NA  0   NA  2  0   ECU   2014    
    0   NA  0   NA  3  2   BRA   2011   
    1   NA  0   NA  4  NA  BRA   2012
    1   NA  0   NA  4  NA  BRA   2013
    1   NA  0   NA  4  NA  BRA   2014",
    header = TRUE
)

But I do not really know where to start. What would be the easiest way to do this?

EDIT: When testing on the actual data, I noticed I overlooked one aspect of the dataset when phrasing the question. In the acutal dataset there is one more unique aspects (column A) in the row apart from the year.

library(data.table)  
df <- fread(
    "A   B  C  D  E  F  iso   year   
    1   A   NA  1  NA  NA  NLD   2009   
    1   Y   NA  2  NA  NA  NLD   2010   
    1   Q   NA  3  NA  NA  AUS   2011   
    1   NA  NA  4  NA  NA  AUS   2012 
    1   0   NA  7  NA  NA  NLD   2011  
    1   NA  1   NA  1  NA  ECU   2009   
    1   NA  0   NA  2  0   ECU   2010    
    1   NA  0   NA  2  0   ECU   2011    
    1   NA  0   NA  3  2   BRA   2011   
    1   NA  0   NA  4  NA  BRA   2012
    0   A   NA  1  NA  NA  NLD   2009   
    0   Y   NA  2  NA  NA  NLD   2010   
    0   Q   NA  3  NA  NA  AUS   2011   
    0   NA  NA  4  NA  NA  AUS   2012 
    0   0   NA  7  NA  NA  NLD   2011  
    0   NA  1   NA  1  NA  ECU   2009   
    0   NA  0   NA  2  0   ECU   2010    
    0   NA  0   NA  2  0   ECU   2011    
    0   NA  0   NA  3  2   BRA   2011   
    0   NA  0   NA  4  NA  BRA   2012",
    header = TRUE
)

I have tried to adapt Ronak's result in this respect:

df <- df %>%
  group_by(iso, A) %>%
  slice(c(1:n(), rep(n(), 2014 - last(year)))) %>%
  mutate(year = first(year):2014)
###
df <- df %>%
  #If not arranged by year
  arrange(iso, year, A)
  group_by(iso) %>%
  complete(year = seq(min(year), 2014)) %>%
  fill(B:F)

The top code works, but on my actual data gives the error:

Error: Column `year` must be length 18 (the group size) or one, not 21

The bottom option gives the following error when applied to the example:

Error in group_by(iso) : object 'iso' not found
Tom
  • 2,173
  • 1
  • 17
  • 44
  • Possible duplicate of [R - Fill missing dates by group](https://stackoverflow.com/questions/48633460/r-fill-missing-dates-by-group) – CPak Oct 16 '19 at 11:19
  • Yes, I did :Z. Your post is really asking two questions: how to complete sequence of contiguous values, and how to fill missing values with last value. Both of these questions have been asked and answered before. https://stackoverflow.com/questions/42570024/r-fill-missing-value-with-prior-values – CPak Oct 16 '19 at 12:08

1 Answers1

2

To replicate the last row, we can group_by iso and repeat the last row in each group

df %>%
  group_by(iso) %>%
  slice(c(1:n(), rep(n(), 2014 - last(year)))) %>%
  mutate(year = first(year):2014)

For the updated data, where year values are repeated

df %>%
  group_by(iso) %>%
  mutate(row = row_number()) %>%
  slice(row, rep(n(), 2014 - last(year))) %>%
  mutate(year = c(year[1:max(row)], year[max(row)] + 1:(2014 - year[max(row)]))) %>%
  select(-row)

#       A B         C     D     E     F iso    year
#   <int> <chr> <int> <int> <int> <int> <chr> <int>
# 1     1 Q        NA     3    NA    NA AUS    2011
# 2     1 NA       NA     4    NA    NA AUS    2012
# 3     0 Q        NA     3    NA    NA AUS    2011
# 4     0 NA       NA     4    NA    NA AUS    2012
# 5     0 NA       NA     4    NA    NA AUS    2013
# 6     0 NA       NA     4    NA    NA AUS    2014
# 7     1 NA        0    NA     3     2 BRA    2011
# 8     1 NA        0    NA     4    NA BRA    2012
# 9     0 NA        0    NA     3     2 BRA    2011
#10     0 NA        0    NA     4    NA BRA    2012
# … with 20 more rows
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you for your answer Ronak! When applying your solution to my actual data, I noticed I made a small mistake in my data example. I adapated your answer to it, but i am not completely succesful (see EDIT). Any idea what could be the issue? – Tom Oct 16 '19 at 12:01
  • @Tom The issue is `year` column has duplicated values hence the values can't be replicated properly. This part works correctly `df %>% group_by(iso) %>% slice(c(1:n(), rep(n(), 2014 - last(year))))` but assigning new years gives an error. – Ronak Shah Oct 16 '19 at 12:26
  • I cannot really figure out how to work around it.. Should I post a new question? – Tom Oct 16 '19 at 12:46
  • @Tom i have updated the answer. See if it works now ? – Ronak Shah Oct 16 '19 at 12:57
  • Hey Ronak, I have a small question about your answer. Should the part `group_by(iso) %>%` not be `group_by(iso, A) %>%`? In my actual data it otherwise just stops at the first value it encounters ( instead of doing the operation for each value of A). I have added the column, to the grouping and it seems to work. – Tom Nov 07 '19 at 15:48
  • Maybe one more thing haha.. I am trying to reverse your solution for dates before the first date until 2008 for example. I think the slice part should then become `slice(row, rep(n(), first(year)-2008 )) %>%` But I am not sure how the `mutate` part should look.. – Tom Nov 07 '19 at 16:01
  • @Tom Do you want to repeat first line or the last line till 2008 ? To repeat first line you can do `df %>% group_by(iso) %>% slice(c(1:n(), rep(1, first(year) - 2008))) %>% arrange(iso, year) %>% mutate(year = 2008:last(year))` – Ronak Shah Nov 08 '19 at 00:30
  • The first one indeed, thanks! I think I am encountering a problem with the last line, because of different group sizes (in the actual data): `Error: Column "year" must be length 36 (the group size) or one, not 39`. Any idea how to solve that? – Tom Nov 08 '19 at 09:58
  • Do you have some missing years ? Say 2010, 2011, 2014 2015 etc ? – Ronak Shah Nov 08 '19 at 10:55
  • Most likely yes.. I first checked for double years (and removed them), but that did not solve the problem. It is quite a large dataset.. – Tom Nov 08 '19 at 10:56
  • Try replacing the las `mutate` line with `mutate(year = c(2008:first(year), setdiff(year, first(year))))` – Ronak Shah Nov 08 '19 at 11:09