0

I have a data frame like this:

x y z country
1 4 1 USA
3 1 1 Canada
0 1 1 Spain
0 2 3 USA
4 1 1 Canada

I need to select the data which countries appear at least 1000 times through all the data frame. Let's say, for example, that only USA and Canada meet that condition. The problem is that I have more than 40 countries and 500000 cases so I can't do it case by case.

I suppose that I need a loop "for" to do so, but I can't figure how to do it.

Admovin
  • 15
  • 3

3 Answers3

0

First get the names of the countries you want. Then subset by those names.

tab <- table(df$country)
mycountries <- names(tab[tab > 1000])
df <- df[df$country %in% mycountries, ]
cory
  • 6,529
  • 3
  • 21
  • 41
0

With data.table and by assuming your dataframe is named df, we can create a variable named count that counts the total number of rows for each country, and then subset to only those countries with >1000 rows:

library(data.table)
setDT(df)

df[ , count := .N, by=country]
df[count > 1000]
DanY
  • 5,920
  • 1
  • 13
  • 33
0

One possible solution using dplyr:

library(dplyr)

df %>%
  group_by(country) %>%
  summarise(count = n()) %>%
  filter(count >= 1000) %>%
  arrange(desc(count))
OzanStats
  • 2,756
  • 1
  • 13
  • 26