How do I filter the results in table()

Question

I have a dataframe with a few character columns and a date column and a string column.

One of the columns is a list of cities and I'd like to know which cities show up the most in my dataset. I used table(dataframe$city), but it gave me a list of every city (including cities that show up just once or twice).

How do I filter the results of my city to show just the cities in the top quartile, based on the number of times they appear in the data?

here's example input :

id  price   city
1   $0.8    los angeles
2   $0.8    new york
3   $0.5    new york
4   $0.6    new york
5   $0.9    los angeles
6   $0.1    houston
7   $0.7    chicago
8   $0.8    new york
9   $0.7    new york
10  $0.0    new york
11  $0.5    new york
12  $0.1    new york
13  $0.9    new york
14  $0.3    los angeles
15  $0.9    los angeles
16  $0.9    los angeles
17  $0.8    los angeles
18  $0.5    miami
19  $0.9    boston
20  $1.0    newton
21  $0.2    san mateo
22  $0.3    milbrae

When I do table(dataframe$city), I get a list of every city and a count of how many times it appears. What if I just want a list of the cities that appear more than average (like new york and los angeles)?

It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and the desired output. — MrFlick, Dec 12 '17 at 17:27
Simply save your table() results into an object and sort it. With base, soy can sort it with df[order(variale_to_be_sorted), ] — Nicolás Velasquez, Dec 12 '17 at 17:33

David Klotz · Accepted Answer · 2017-12-12T17:44:43.860

2

For one factor, table will return an array (similar to a vector). So in order to 'filter' it, you will need to subset it with whatever conditions you specify.

tbl <- table(dataframe$city)
quants <- quantile(tbl)

tbl[tbl >= quants['75%']]

Edit:

OP changed the request from top quartile to 'above average'. That's even simpler:

tbl <- table(dataframe$city)
tbl[tbl >= mean(tbl)]

edited Dec 12 '17 at 17:44

answered Dec 12 '17 at 17:38

David Klotz

2,401
1
7
16

score 2 · Answer 2 · answered Dec 12 '17 at 17:39

2

Here's an example you can run:

dat <- sample(1:10, size = 200, replace = TRUE)

dat_table <- table(dat)

dat_table[dat_table >= quantile(dat_table, probs = 0.75 )]

answered Dec 12 '17 at 17:39

lukembrowne

23
4

score 1 · Answer 3 · answered Dec 12 '17 at 17:35

1

Try this:

tab = table(dataframe$city)
# sort it
(sort_tab = sort(tab, decreasing = T))
# take only the top quarter
head(sort_tab, length(sort_tab) / 4)

answered Dec 12 '17 at 17:35

Gregor Thomas

136,190
20
167
294

Might have an issue with ties (in terms of frequencies)? – AntoniosK Dec 12 '17 at 17:37
1

Yes, this is definitely a "quick and dirty" solution. Really just demonstrating how far you can get with very basic commands. – Gregor Thomas Dec 12 '17 at 17:42

score 1 · Answer 4 · answered Dec 12 '17 at 17:39

A dplyr option using dummy data...

library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
#> ✔ tibble  1.3.4     ✔ dplyr   0.7.4
#> ✔ tidyr   0.7.2     ✔ stringr 1.2.0
#> ✔ readr   1.1.1     ✔ forcats 0.2.0
#> Warning: package 'tidyr' was built under R version 3.4.2
#> Warning: package 'purrr' was built under R version 3.4.2
#> Warning: package 'dplyr' was built under R version 3.4.2
#> ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

city_data <- tibble(city = c("LA", "DC", "DC", "LA", "CL", "DC", "NY"), A = sample(1:10, 7), B = sample(1:10, 7))

city_data %>%
  count(city) %>%
  filter(., n> nrow(.)/4)
#> # A tibble: 2 x 2
#>    city     n
#>   <chr> <int>
#> 1    DC     3
#> 2    LA     2

score 1 · Answer 5 · answered Dec 12 '17 at 17:44

Yet another example:

# some example data... pretend letters are cities
dat <- sample(letters,250,replace=T)

# take the full table
tab <- table(dat)

# Here's my inelegant solution:
ord <- sort(tab/sum(tab),decreasing=T)
len <- length(tab)
top25.percentile <- floor(len/4)
show.nms <- names(ord[1:top25.percentile])
tab[which(names(tab) %in% show.nms)]

How do I filter the results in table()

5 Answers5