How to find the latest data by variable

Question

I would like to know how to extract data by latest date by coding.

cname       year   x1 x2 x3 x4
Afghanistan 2015   3  2  6  3
Afghanistan 2016   4  7  NA 9
Afghanistan 2017   5  NA NA NA  
Albania     2015   2  3  4  3
Albania     2016   2  4  NA NA
Albania     2017   4  NA 8  NA  
Algeria     2015   NA NA NA NA
Algeria     2016   NA NA NA NA
Algeria     2017   NA NA NA NA
...

The answer I would like to get

 cname          x1 x2 x3 x4
   Afghanistan   5  7  6  9
   Albania       4  4  8  3
   Algeria      NA NA NA NA
...

I thied to find the answers here, but what I've found is only subsetting the last observation by grouping. Thanks for your help!

Ben · Answer 1 · 2018-11-26T18:51:45.277

2

Here's a solution with zoo and dplyr:

library(zoo)
library(dplyr)
df <- read.table("clipboard", header=T)
df %>%
  group_by(cname) %>%
  arrange(year) %>%
  mutate_all(funs(na.locf(., na.rm = FALSE))) %>% 
  filter(year==max(year))

# A tibble: 3 x 6
# Groups:   cname [3]
  cname        year    x1    x2    x3    x4
  <fct>       <int> <int> <int> <int> <int>
1 Afghanistan  2017     5     7     6     9
2 Albania      2017     4     4     8     3
3 Algeria      2017    NA    NA    NA    NA

edited Nov 26 '18 at 18:51

answered Nov 26 '18 at 15:34

Ben

784
5
14

1

use `filter(year==max(year))` if there was a group without data for 2017 (generally hard-coding values is not a good idea). – M-- Nov 26 '18 at 18:45
i edited the answer according to your hepful suggestion! – Ben Nov 26 '18 at 18:52

score 1 · Answer 2 · answered Nov 26 '18 at 15:31

Here is an idea via base R, however using coalesce from dplyr to 'merge' the NAs with non-NAs, i.e.

sapply(split(df, df$cname), function(i) { d2 <- data.frame(t(i[order(i$year, decreasing = TRUE), -1])); 
                                          do.call(dplyr::coalesce, d2) })

which gives,

     Afghanistan Albania Algeria
[1,]         2017    2017    2017
[2,]           5       4      NA
[3,]           7       4      NA
[4,]           6       8      NA
[5,]           9       3      NA

Transpose the above result to get your desired output (or wrap it in data.frame, or handle however you need), i.e.

t(d3)
            [,1] [,2] [,3] [,4] [,5]
Afghanistan 2017    5    7    6    9
Albania     2017    4    4    8    3
Algeria     2017   NA   NA   NA   NA

DATA

dput(df)
structure(list(cname = c("Afghanistan", "Afghanistan", "Afghanistan", 
"Albania", "Albania", "Albania", "Algeria", "Algeria", "Algeria"
), year = c(2015L, 2016L, 2017L, 2015L, 2016L, 2017L, 2015L, 
2016L, 2017L), x1 = c(3L, 4L, 5L, 2L, 2L, 4L, NA, NA, NA), x2 = c(2L, 
7L, NA, 3L, 4L, NA, NA, NA, NA), x3 = c(6L, NA, NA, 4L, NA, 8L, 
NA, NA, NA), x4 = c(3L, 9L, NA, 3L, NA, NA, NA, NA, NA)), row.names = c(NA, 
-9L), class = "data.frame")

score 0 · Answer 3 · answered Nov 26 '18 at 15:29

This is working on your example, I haven't tested on other possibilites:

Note that d needs to be ordered by cname and year before calling the function (like in your example).

l <- split(d, d$cname) #  we split each in a list
l <- lapply(l, function(x) x[, -c(1,2)]) # remove non useful infos
#l <- split(d[, -c(1,2)], d$cname) # this avoids this second line

# this basically seeks for the last non NA value, otherwise uses NA if all are NA
ll <- lapply(l, function(x) {
  if (!all(is.na(x))) {
    sapply(x, function(y) last(y[!is.na(y)])) # inside each element in the list I search for last non-NA
  } else {
    NA
  }
})

t(as.data.frame(ll))
#             x1 x2 x3 x4
# Afghanistan  5  7  6  9
# Albania      4  4  8  3
# Algeria     NA NA NA NA

Basically I used list, and seek for the last non NA value. I encourage to print each passage to see what's happening.

This could turn into a function:

my_function <- function(data) {
  l <- split(data, data$cname)
  l <- lapply(l, function(x) x[, -c(1,2)])

  ll <- lapply(l, function(x) {
    if (!all(is.na(x))) {
      sapply(x, function(y) last(y[!is.na(y)]))
    } else {
      NA
    }
  })

  t(as.data.frame(ll)) # return
}
my_function(d)
#             x1 x2 x3 x4
# Afghanistan  5  7  6  9
# Albania      4  4  8  3
# Algeria     NA NA NA NA

Data used:

tt<-"cname       year   x1 x2 x3 x4
Afghanistan 2015   3  2  6  3
Afghanistan 2016   4  7  NA 9
Afghanistan 2017   5  NA NA NA  
Albania     2015   2  3  4  3
Albania     2016   2  4  NA NA
Albania     2017   4  NA 8  NA  
Algeria     2015   NA NA NA NA
Algeria     2016   NA NA NA NA
Algeria     2017   NA NA NA NA"

d <- read.table(text=tt, header = T)

score 0 · Answer 4 · answered Nov 26 '18 at 16:27

Here's a dplyr/tidyr solution. I'm grouping by name, making sure observations are ordered by year, then using tidyr::fill to replace NAs with the most recent non-NA value above, within groups.

I'd also recommend filtering for when the year is equal to the maximum year, i.e. the most recent year, rather than hard-coding the year you want to keep. That way it scales well in case you have new data with years more recent than 2017. The way I have it set up now filters for the latest year within each country; if, however, you need to filter for the latest year for all countries, add a call to ungroup before filtering.

library(dplyr)
library(tidyr)

df %>%
  group_by(cname) %>%
  arrange(year) %>%
  fill(x1:x4) %>%
  filter(year == max(year)) %>%
  select(-year)
#> # A tibble: 3 x 5
#> # Groups:   cname [3]
#>   cname          x1    x2    x3    x4
#>   <chr>       <int> <int> <int> <int>
#> 1 Afghanistan     5     7     6     9
#> 2 Albania         4     4     8     3
#> 3 Algeria        NA    NA    NA    NA

^{Created on 2018-11-26 by the reprex package (v0.2.1)}

How to find the latest data by variable

4 Answers4