1

Here I use DemocracyIncome as my dataset. It can be reached from R package pder and the codes are as follows:

library(pder)
data("DemocracyIncome", package = "pder")
df <- na.omit(DemocracyIncome)

Part of the dataset is as:

country                  year               democracy               income              sample
Angola                1965-1969             0.1200000              7.963571                0
Angola                1975-1979             0.1666667              7.642973                0
Angola                1980-1984             0.0000000              7.563512                1
Angola                1985-1989             0.0000000              7.528483                1
Angola                1990-1994             0.0000000              7.573770                1
Angola                1995-1999             0.1666667              7.132994                1
Albania               1995-1999             0.6666667              7.947575                1
Albania               2000-2004             0.5000000              8.115600                1
Argentina             1950-1954             0.4900000              8.768732                0
Argentina             1955-1959             0.3000000              8.833524                0
Argentina             1960-1964             0.6300000              8.905374                1
...

Now I want to create a new dataset using the first observation of each country, which is supposed to be

country                  year               democracy               income              sample
Angola                1965-1969             0.1200000              7.963571                0
Albania               1995-1999             0.6666667              7.947575                1
Argentina             1950-1954             0.4900000              8.768732                0
...

How can I filter df and get this new dataset then?

w12345678
  • 63
  • 6

2 Answers2

0

We can use duplicated in base R

df_filter <- df[!duplicated(df$country),]

Or with distinct

library(dplyr)
distinct(df, country, .keep_all = TRUE)
akrun
  • 874,273
  • 37
  • 540
  • 662
0

You can try this approach

library(dplyr)
df %>% 
  group_by(country) %>% 
  arrange(year) %>% 
  slice(1) %>% 
  ungroup()

#   country   year      democracy income sample
#   <chr>     <chr>         <dbl>  <dbl>  <int>
# 1 Albania   1995-1999     0.667   7.95      1
# 2 Angola    1965-1969     0.12    7.96      0
# 3 Argentina 1950-1954     0.49    8.77      0

Data

df <- structure(list(country = c("Angola", "Angola", "Angola", "Angola", 
"Angola", "Angola", "Albania", "Albania", "Argentina", "Argentina", 
"Argentina"), year = c("1965-1969", "1975-1979", "1980-1984", 
"1985-1989", "1990-1994", "1995-1999", "1995-1999", "2000-2004", 
"1950-1954", "1955-1959", "1960-1964"), democracy = c(0.12, 0.1666667, 
0, 0, 0, 0.1666667, 0.6666667, 0.5, 0.49, 0.3, 0.63), income = c(7.963571, 
7.642973, 7.563512, 7.528483, 7.57377, 7.132994, 7.947575, 8.1156, 
8.768732, 8.833524, 8.905374), sample = c(0L, 0L, 1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, 
-11L))
Tho Vu
  • 1,304
  • 2
  • 8
  • 20