I am using my own version of the gapminder data set and trying to see which country has realized the most growth from 2008 to 2018. When i'm using the original gapminder data, it works fine but for some reason I cannot replicate on my own data set? The problem is that I cannot use na.locf()
because all the "2008" rows populate before "2018"
I am using the spread function but it returns values in a way where I can't carry the last observation forward and the group_by
function does not seem to work
# The code on the original data that works fine
library(gapminder)
gapminder %>%
filter(year %in% c("1952", "1957")) %>%
spread(year, pop) %>%
na.locf() %>%
mutate(diff = `1957` - `1952`)
However, when I use my data set (the structure is the same), it changes the data in a way that is difficult to subtract
> class(gapminder_df$Year)
[1] "integer"
> class(gapminder_df$population)
[1] "numeric"
# and
> nrow(gapminder_df[gapminder_df$Year == "2018",])
[1] 134
> nrow(gapminder_df[gapminder_df$Year == "2008",])
[1] 134
top_10 <- gapminder_df %>%
filter(Year %in% c("2008", "2018")) %>%
spread(Year, population) %>%
na.locf()
the first column has NAs for the first half of rows and the second column returns NAs for the second half and therefore I can't subtract... group_by(country)
doesn't provide desirable results:
2018 2008 country
1 NA 27300000 Afghanistan
2 NA 2990000 Albania
3 NA 34900000 Algeria
4 NA 21800000 Angola
here is a sample of the data
gapminder_df <- tibble(
Country = c(rep("Afganistan", 4), rep("Albania", 4), rep("Algeria",4),rep("Angola",4)),
Year = rep(c("2008", "2009", "2018", "2004"), 4),
population = rnorm(16, mean = 5000000, sd = 50)
)
EDIT: I was able to fix it by selecting only relevant columns before spread... can someone explain to me why that worked? I guess I had multiple of the same dates for the same countries with many different values for other variables?
top_10 <- gapminder_df %>%
select(country, Year, population) %>%
filter(Year %in% c("2008", "2018")) %>%
spread(Year, population)