1

I am using my own version of the gapminder data set and trying to see which country has realized the most growth from 2008 to 2018. When i'm using the original gapminder data, it works fine but for some reason I cannot replicate on my own data set? The problem is that I cannot use na.locf() because all the "2008" rows populate before "2018"

I am using the spread function but it returns values in a way where I can't carry the last observation forward and the group_by function does not seem to work

# The code on the original data that works fine
library(gapminder)
gapminder %>% 
  filter(year %in% c("1952", "1957")) %>% 
  spread(year, pop) %>% 
  na.locf() %>% 
  mutate(diff = `1957` - `1952`)

However, when I use my data set (the structure is the same), it changes the data in a way that is difficult to subtract

> class(gapminder_df$Year)
[1] "integer"

> class(gapminder_df$population)
[1] "numeric"

# and 

> nrow(gapminder_df[gapminder_df$Year == "2018",])
[1] 134
> nrow(gapminder_df[gapminder_df$Year == "2008",])
[1] 134
top_10 <- gapminder_df %>% 
  filter(Year %in% c("2008", "2018")) %>%
  spread(Year, population) %>% 
  na.locf()

the first column has NAs for the first half of rows and the second column returns NAs for the second half and therefore I can't subtract... group_by(country) doesn't provide desirable results:

  2018     2008     country
1   NA 27300000 Afghanistan
2   NA  2990000     Albania
3   NA 34900000     Algeria
4   NA 21800000      Angola

here is a sample of the data

gapminder_df <- tibble(

  Country = c(rep("Afganistan", 4), rep("Albania", 4), rep("Algeria",4),rep("Angola",4)),
  Year = rep(c("2008", "2009", "2018", "2004"), 4),
  population = rnorm(16, mean = 5000000, sd = 50)

)

EDIT: I was able to fix it by selecting only relevant columns before spread... can someone explain to me why that worked? I guess I had multiple of the same dates for the same countries with many different values for other variables?


top_10 <- gapminder_df %>%
  select(country, Year, population) %>% 
  filter(Year %in% c("2008", "2018")) %>%
  spread(Year, population) 

Brennan Beal
  • 108
  • 7
  • 1
    Please provide a small example of `gapminder_df` so we can take a look at things. If you need ideas for how to to share an example of your data, see [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – aosmith May 03 '19 at 16:26
  • ```gapminder_df <- tibble( Country = c(rep("Afganistan", 4), rep("Albania", 4), rep("Algeria",4),rep("Angola",4)), Year = rep(c("2008", "2009", "2018", "2004"), 4), population = rnorm(16, mean = 5000000, sd = 50) )``` – Brennan Beal May 03 '19 at 16:35
  • I think that should do it1 – Brennan Beal May 03 '19 at 16:35
  • Actually, my code works on the sample data too... ? Just not my actual data... what in the world. The structures are the same for all relevant variables? – Brennan Beal May 03 '19 at 16:39
  • It looks like the order of the `gapminder` output is based on the order of some of the additional variables in the dataset. Since `lifeExp`/`gdpPerCap` are always higher for the second year you get `1952` listed before `1957`. If you create another column in your dataset that does not have that order you can reproduce your issue. Removing the extra columns will help (as you saw), since then the `pop` info can be `spread()` into one row instead of two. – aosmith May 03 '19 at 16:45
  • Yea, that worked... holy cow I have looked at that problem for such a like... an hour. So frustrating. Thanks for the help! – Brennan Beal May 03 '19 at 16:47

0 Answers0