1

I have a dataframe with a few character columns and a date column and a string column.

One of the columns is a list of cities and I'd like to know which cities show up the most in my dataset. I used table(dataframe$city), but it gave me a list of every city (including cities that show up just once or twice).

How do I filter the results of my city to show just the cities in the top quartile, based on the number of times they appear in the data?

here's example input :

id  price   city
1   $0.8    los angeles
2   $0.8    new york
3   $0.5    new york
4   $0.6    new york
5   $0.9    los angeles
6   $0.1    houston
7   $0.7    chicago
8   $0.8    new york
9   $0.7    new york
10  $0.0    new york
11  $0.5    new york
12  $0.1    new york
13  $0.9    new york
14  $0.3    los angeles
15  $0.9    los angeles
16  $0.9    los angeles
17  $0.8    los angeles
18  $0.5    miami
19  $0.9    boston
20  $1.0    newton
21  $0.2    san mateo
22  $0.3    milbrae

When I do table(dataframe$city), I get a list of every city and a count of how many times it appears. What if I just want a list of the cities that appear more than average (like new york and los angeles)?

MrFlick
  • 195,160
  • 17
  • 277
  • 295
Sebastian
  • 957
  • 3
  • 15
  • 27
  • 4
    It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and the desired output. – MrFlick Dec 12 '17 at 17:27
  • Simply save your table() results into an object and sort it. With base, soy can sort it with df[order(variale_to_be_sorted), ] – Nicolás Velasquez Dec 12 '17 at 17:33

5 Answers5

2

For one factor, table will return an array (similar to a vector). So in order to 'filter' it, you will need to subset it with whatever conditions you specify.

tbl <- table(dataframe$city)
quants <- quantile(tbl)

tbl[tbl >= quants['75%']]

Edit:

OP changed the request from top quartile to 'above average'. That's even simpler:

tbl <- table(dataframe$city)
tbl[tbl >= mean(tbl)]
David Klotz
  • 2,401
  • 1
  • 7
  • 16
2

Here's an example you can run:

dat <- sample(1:10, size = 200, replace = TRUE)

dat_table <- table(dat)

dat_table[dat_table >= quantile(dat_table, probs = 0.75 )]
1

Try this:

tab = table(dataframe$city)
# sort it
(sort_tab = sort(tab, decreasing = T))
# take only the top quarter
head(sort_tab, length(sort_tab) / 4)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

A dplyr option using dummy data...


library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
#> ✔ tibble  1.3.4     ✔ dplyr   0.7.4
#> ✔ tidyr   0.7.2     ✔ stringr 1.2.0
#> ✔ readr   1.1.1     ✔ forcats 0.2.0
#> Warning: package 'tidyr' was built under R version 3.4.2
#> Warning: package 'purrr' was built under R version 3.4.2
#> Warning: package 'dplyr' was built under R version 3.4.2
#> ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

city_data <- tibble(city = c("LA", "DC", "DC", "LA", "CL", "DC", "NY"), A = sample(1:10, 7), B = sample(1:10, 7))

city_data %>%
  count(city) %>%
  filter(., n> nrow(.)/4)
#> # A tibble: 2 x 2
#>    city     n
#>   <chr> <int>
#> 1    DC     3
#> 2    LA     2
Ryan
  • 281
  • 3
  • 12
1

Yet another example:

# some example data... pretend letters are cities
dat <- sample(letters,250,replace=T)

# take the full table
tab <- table(dat)

# Here's my inelegant solution:
ord <- sort(tab/sum(tab),decreasing=T)
len <- length(tab)
top25.percentile <- floor(len/4)
show.nms <- names(ord[1:top25.percentile])
tab[which(names(tab) %in% show.nms)]
Nate
  • 364
  • 1
  • 5