-1

this might be a simple question but I didn't find a solution among the "similar questions" - surely it has been asked though. Anyways, if there's an answer to my question elsewhere, please do let me know..

Now, to my problem. I have a data that looks something like this:

enter image description here

But I'd like to structure it like this:

enter image description here

In other words, country-year structure. Without double observations per year and subject. Any suggestions would be great!

FKG
  • 285
  • 1
  • 4
  • 17

3 Answers3

1

We can use aggregate from base R (no packages needed)

aggregate(Air_pollution ~ ., df, FUN = sum)

-output

#   country_code country year Air_pollution
#1           22       A 2000             6
#2           44       B 2000             2
#3           66       C 2000            10
#4           88       D 2000             7
#5           22       A 2001             2
#6           88       D 2001            15

Or with dplyr

library(dplyr)
df %>%
    group_by(country_code, country, year) %>%
    summarise(Air_pollution = sum(Air_pollution), .groups = 'drop')

-output

# A tibble: 6 x 4
#  country_code country  year Air_pollution
#         <dbl> <chr>   <dbl>         <dbl>
#1           22 A        2000             6
#2           22 A        2001             2
#3           44 B        2000             2
#4           66 C        2000            10
#5           88 D        2000             7
#6           88 D        2001            15

data

df <- structure(list(country_code = c(22, 22, 22, 44, 44, 66, 88, 88, 
88, 88), country = c("A", "A", "A", "B", "B", "C", "D", "D", 
"D", "D"), year = c(2000, 2000, 2001, 2000, 2000, 2000, 2000, 
2001, 2001, 2001), Air_pollution = c(5, 1, 2, 1, 1, 10, 7, 5, 
5, 5)), class = "data.frame", row.names = c(NA, -10L))
akrun
  • 874,273
  • 37
  • 540
  • 662
1

A data.table option

> setDT(df)[, .(Air_pollution = sum(Air_pollution)), by = country_code:year]
   country_code country year Air_pollution
1:           22       A 2000             6
2:           22       A 2001             2
3:           44       B 2000             2
4:           66       C 2000            10
5:           88       D 2000             7
6:           88       D 2001            15

Data

> dput(df)
structure(list(country_code = c(22, 22, 22, 44, 44, 66, 88, 88,
88, 88), country = c("A", "A", "A", "B", "B", "C", "D", "D", 
"D", "D"), year = c(2000, 2000, 2001, 2000, 2000, 2000, 2000,
2001, 2001, 2001), Air_pollution = c(5, 1, 2, 1, 1, 10, 7, 5,
5, 5)), class = "data.frame", row.names = c(NA, -10L))
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
0

Try:

library(tidyverse)
new <- df %>% group_by(country_coo,country,year) %>%
  summarise(Air_pollution=sum(Air_pollution,na.rm = T))

Output:

# A tibble: 6 x 4
# Groups:   country_coo, country [4]
  country_coo country  year Air_pollution
        <dbl> <chr>   <dbl>         <dbl>
1          22 A        2000             6
2          22 A        2001             1
3          44 B        2000             2
4          66 C        2000            10
5          88 D        2000             7
6          88 D        2001            15

Some data used:

#Data
df <- structure(list(country_coo = c(22, 22, 22, 44, 44, 66, 88, 88, 
88, 88), country = c("A", "A", "A", "B", "B", "C", "D", "D", 
"D", "D"), year = c(2000, 2000, 2001, 2000, 2000, 2000, 2000, 
2001, 2001, 2001), Air_pollution = c(5, 1, 1, 1, 1, 10, 7, 5, 
5, 5)), class = "data.frame", row.names = c(NA, -10L))
Duck
  • 39,058
  • 13
  • 42
  • 84