0

I have a large dataset with 40 variables for 20 countries where some columns have NAs in the first (and last) years. I would like to extract the first and last year for which there is no NA in a column for a given country.

The code below successfully gives the first and last year for which there is no NA in the GDP column for the UK.

min(sort(table$year[table$Country=="UK"][which(!is.na(table$GDP))]))
[1] "1959"
max(sort(table$year[table$Country=="UK"][which(!is.na(table$GDP))]))
[1] "2020"

However, this manual approach is not efficient. Therefore, I was wondering if this could be done for all countries and all variables using e.g. the tidyverse? I would like to achieve an output that shows the min and max year without NA'S for every country and every variable:

 Country  Variable  min  max
1      UK       GDP 1959 2020
2      FR       GDP 1980 2020
3      IT       GDP 1980 2020
4      UK Inflation 1990 2022
5      FR Inflation 2000 2022
6      IT Inflation 2000 2022

Thank you!

  • 2
    Can you post a sample data so we can test the code? `dput(head(df, 20))` – Chamkrai Jan 18 '23 at 14:03
  • 1
    +1 for the suggestion/reminder to add a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Adding a MRE and an example of the desired output (in code form, not tables and pictures) makes it much easier for others to find and test an answer to your question. That way you can help others to help you! You can use tidyverse and group by each country and each variable and summarise using min and max with the na.rm=TRUE attribute – dario Jan 18 '23 at 14:12

1 Answers1

1

You can pivot your data longer, remove rows where the value is NA, and take the min, max year:

library(dplyr)
library(tidyr)

pivot_longer(df, cols = -c(Country,year),names_to = "Variable") %>% 
  filter(!is.na(value)) %>% 
  group_by(Country, Variable) %>% 
  summarize(MinYear = min(year), MaxYear=max(year))

Note that this assumes that other than Country and year columns, all other columns are target variables of interest (i.e cols = -c(Country, year)). If this is not the case, there are other options, such as:

  • using a range of columns like thiscols = GDP:Inflation
  • place your target cols in a vector target_vars = c("GDP", "Inflation") and use cols = all_of(target_vars)
langtang
  • 22,248
  • 1
  • 12
  • 27
  • Thank you so much for your help, this worked like a charm and saved me so much time! Have a great rest of your day! – karlakolumna Jan 18 '23 at 14:59