0

I'm trying to group by the column "Subregion" to get the population number for each Subregion in the latest and earliest years (1950, 2020) found in the dataset, and it doesn't return a grouped dataset for some reason. I tried to remove and reorder some of the code lines and nothing works

first_year <- min(as.numeric(data.tidy$Year))
last_year <- max(as.numeric(data.tidy$Year))

Population.Growth.Subregion <- data.tidy %>%
  filter(Year %in% c(first_year, last_year), Population.Average %in% "other") %>%
  na.omit() %>%
  spread(Year, Population.Total) %>%
  group_by(Subregion) %>%
  mutate(Growth = 100*(
    (get(as.character(last_year))/get(as.character(first_year)))^
      (1/(last_year-first_year)) - 1)
  ) %>%
  print()

Returns

 Country Subregion       Code  Age.Group  Population.Average  `1950`   `2020` Growth
   <chr>   <chr>           <chr> <chr>      <chr>                <dbl>    <dbl>  <dbl>
 1 Algeria Northern Africa DZA   15_24      other              1724431  5910182   1.78
 2 Algeria Northern Africa DZA   25_64      other              3230562 21485130   2.74
 3 Algeria Northern Africa DZA   5_14       other              2199620  8457374   1.94
 4 Algeria Northern Africa DZA   65_or_over other               314503  2956839   3.25
 5 Algeria Northern Africa DZA   Under_5    other              1403134  5041518   1.84
 6 Angola  Middle Africa   AGO   15_24      other               884289  6415084   2.87
 7 Angola  Middle Africa   AGO   25_64      other              1705016 10482505   2.63
 8 Angola  Middle Africa   AGO   5_14       other              1085648  9453425   3.14
 9 Angola  Middle Africa   AGO   65_or_over other               133832   720250   2.43
10 Angola  Middle Africa   AGO   Under_5    other               739236  5795004   2.99
# … with 255 more rows

Edit

That's how the dataset looks like after the snippet I added above:

Dataset After

That's how it looked beforehand

Dataset Before

That's what I wanted to get

What I want

Tal Levi
  • 1
  • 1
  • 2
    Welcome! Could you tell us what the desired output would look like? Ideally, you could also provide us with a small, reproducible code snippet that we can copy and paste to better understand the issue and test possible solutions. You can share datasets with `dput(YOUR_DATASET)` or smaller samples with `dput(head(YOUR_DATASET))`. (See [this answer](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#5963610) for detailed instructions.) – ktiu Jun 04 '21 at 11:04
  • Just from looking at it, I would try to use the `summarise()` function instead of `mutate()` – Hansel Palencia Jun 04 '21 at 11:13
  • It would be easier to help if you create a small reproducible example along with expected output. Read about [how to give a reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Jun 04 '21 at 12:01
  • @HanselPalencia I tried that already and it just returns the the Subregion and Growth columns but nit grouped by the Subregion. I don't understand why the group_by doesn't work with summarise or mutate, there aren't any errors either – Tal Levi Jun 04 '21 at 16:07
  • I edited the post and added pics – Tal Levi Jun 04 '21 at 16:19

1 Answers1

0

I don't have your dataset, but I think it's structured similarly to the gapminder in the gapminder package (i.e. some variables observed in several years for a set of countries grouped geographically). Let's have a look at that:

library(tidyverse)
gapminder::gapminder

looks like

# A tibble: 1,704 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# … with 1,694 more rows

Then if our objedctive is to "get the population number for each Subregion in the latest and earliest years (1950, 2020) found in the dataset", I would do the following:

gapminder::gapminder %>% 
  group_by(continent, year) %>% 
  summarize(total_pop = sum(pop)) %>% 
  arrange(year) %>% 
  group_by(continent) %>% 
  filter(row_number() %in% c(1, n())) %>% 
  pivot_wider(values_from = "total_pop", names_from = "year") %>% 
  mutate(growth = (`2007` - `1952`)/`1952`)

which returns

# A tibble: 5 x 4
# Groups:   continent [5]
  continent     `1952`     `2007` growth
  <fct>          <dbl>      <dbl>  <dbl>
1 Africa     237640501  929539692  2.91 
2 Americas   345152446  898871184  1.60 
3 Asia      1395357351 3811953827  1.73 
4 Europe     418120846  586098529  0.402
5 Oceania     10686006   24549947  1.30 
Andy Eggers
  • 592
  • 2
  • 10
  • The returned result is what I want, and I tried that but it still all the rows instead of grouped rows by subregions. The group_by just doesn't do any grouping. – Tal Levi Jun 04 '21 at 16:33
  • Considering this comment and your code above, I think you expect `group_by(Subregion)` to do what `group_by(Subregion) %>% summarize(pop = sum(Population))` does. `group_by()` just tells `dplyr` about the structure of the dataset, but it does not alter the dataset. The alteration comes with `summarize()` or `filter()` or `slice()` or other similar functions, applied to the groups you created with `group_by()`. – Andy Eggers Jun 05 '21 at 07:04