R: Selecting cases when a column condition is met

Question

I have a data frame like this:

x y z country
1 4 1 USA
3 1 1 Canada
0 1 1 Spain
0 2 3 USA
4 1 1 Canada

I need to select the data which countries appear at least 1000 times through all the data frame. Let's say, for example, that only USA and Canada meet that condition. The problem is that I have more than 40 countries and 500000 cases so I can't do it case by case.

I suppose that I need a loop "for" to do so, but I can't figure how to do it.

First count: `df$count <- ave(df$country, df$country, length)` then select. `df[df$count>=1000, ]` — jogo, Sep 25 '18 at 13:45

score 0 · Accepted Answer · answered Sep 25 '18 at 13:47

0

First get the names of the countries you want. Then subset by those names.

tab <- table(df$country)
mycountries <- names(tab[tab > 1000])
df <- df[df$country %in% mycountries, ]

answered Sep 25 '18 at 13:47

cory

6,529
3
21
41

can be simplified to: `mycountries <- names(table(df$country) > 1000);df[df$country %in% mycountries, ]` – Andre Elrico Sep 25 '18 at 13:53

score 0 · Answer 2 · answered Sep 25 '18 at 13:48

0

With data.table and by assuming your dataframe is named df, we can create a variable named count that counts the total number of rows for each country, and then subset to only those countries with >1000 rows:

library(data.table)
setDT(df)

df[ , count := .N, by=country]
df[count > 1000]

answered Sep 25 '18 at 13:48

DanY

5,920
1
13
33

2

or `df[, if (.N>1000) .SD, country]` – jogo Sep 25 '18 at 13:53
or `df[df[, .(count=.N), country][count>1000, country], on="country"]` with a join – jogo Sep 25 '18 at 13:57

score 0 · Answer 3 · answered Sep 25 '18 at 13:51

0

One possible solution using dplyr:

library(dplyr)

df %>%
  group_by(country) %>%
  summarise(count = n()) %>%
  filter(count >= 1000) %>%
  arrange(desc(count))

answered Sep 25 '18 at 13:51

OzanStats

2,756
1
13
26

R: Selecting cases when a column condition is met

3 Answers3