How do you plot a histogram of the terms that occur n or more times?

Question

I have a list of words coming straight from file, one per line, that I import with read.csv which produces a data.frame. What I need to do is to compute and plot the numbers of occurences of each of these words. That, I can do easily, but the problem is that I have several hundreds of words, most of which occur just once or twice in the list, so I'm not interested in them.

EDIT https://gist.github.com/anonymous/404a321840936bf15dd2#file-wordlist-csv here is a sample wordlist that you can use to try. It isn't the same I used, I can't share that as it's actual data from actual experiments and I'm not allowed to share it. For all intents and purposes, this list is comparable.

A "simple"

df <- data.frame(table(words$word))
df[df$Freq > 2, ]

does the trick, I now have a list of the words that occur more than twice, as well as a hard headache as to why I have to go from a data.frame to an array and back to a data.frame just to do that, let alone the fact that I have to repeat the name of the data.frame in the actual selection string. Beats me completely.

The problem is that now the filtered data.frame is useless for charting. Suppose this is what I get after filtering

        Var1 Freq
6     aspect    3
24    colour    7
41    differ   18
55    featur    7
58  function   19
81      look    4
82      make    3
85      mean    7
95   opposit   14
108 properti    3
109   purpos    6
112    relat    3
116   rhythm    4
118    shape    6
120  similar    5
123    sound    3

obviously if I just do a

plot(df[df$Freq > 2, ])

I get this

which obviously (obviously?) has all the original terms on the x axis, while the y axis only shows the filtered values. So the next logical step is to try and force R's hand

plot(x=df[df$Freq > 2, ]$Var1, y=df[df$Freq > 2, ]$Freq)

But clearly R knows best and already did that, because I get the exact same result. Using ggplot2 things get a little better

qplot(x=df[df$Freq > 2, ]$Var1, y=df[df$Freq > 2, ]$Freq)

(yay for consistency) but I'd like that to show an actual histograms, y'know, with bars, like the ones they teach in sixth grade, so if I ask that

qplot(x=df[df$Freq > 2, ]$Var1, y=df[df$Freq > 2, ]$Freq) + geom_bar()

I get

Error : Mapping a variable to y and also using stat="bin".
  With stat="bin", it will attempt to set the y value to the count of cases in each group.
  This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
  If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
  If you want y to represent values in the data, use stat="identity".
  See ?geom_bar for examples. (Defunct; last used in version 0.9.2)

so let us try the last suggestion, shall we?

qplot(df[df$Freq > 2, ]$Var1, stat='identity') + geom_bar()

fair enough, but there are my bars? So, back to basics

qplot(words$word) + geom_bar() # even if geom_bar() is probably unnecessary this time

gives me this

Am I crazy or [substitute a long list of ramblings and complaints about R]?

It would have been nice to actually include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so we can reproduce the plot and the ranting isn't really helpful. Chances are your words have been coded as factors. When you subset factors, they remember all there levels (which is often the desired behavior), but if you want to forget about those not included in the subset, you can use `droplevels()`. So i would guess `plot(droplevels(df[df$Freq > 2, ]))` would work but there is no way to test that on your data. — MrFlick, Nov 27 '14 at 19:24
Granted that ranting isn't helpful. I did not include a full example because I explained how I got it: "I have a list of words coming straight from file, one per line, that I import with read.csv which produces a data.frame". I'm editing the question with a sample wordlist. — Morpheu5, Nov 27 '14 at 21:45

score 3 · Accepted Answer · answered Nov 27 '14 at 19:32

3

I generate some random data

set.seed(1)
df <- data.frame(Var1 = letters, Freq = sample(1: 8, 26, T))

Then I use dplyr::filter because it is very fast and easy.

library(ggplot2); library(dplyr)
qplot(data = filter(df, Freq > 2), Var1, Freq, geom= "bar", stat = "identity")

answered Nov 27 '14 at 19:32

Davide Passaretti

2,741
1
21
32

You can just use `subset` from base here, for the same exact result (`subset(df, Freq>2)`); no need to load dplyr, though I agree it's a useful library – arvi1000 Nov 27 '14 at 20:23
Now, if I type that qplot line, all I get is `Error in filter(df, Freq > 2) : object 'Freq' not found`, but if I use `subset`, no issues at all. Actually, with subset I get exactly the char I wanted. Talk about inconsistencies... Anyway, I'm marking this as the correct answer because it got me closer, but it'd be nice if you could shed some light on the issue, for clarity sake. – Morpheu5 Nov 27 '14 at 21:55
Just two first guesses: 1) are you sure you're using `dplyr::filter` and not `stats::filter` ? 2) are you sure you're working on a data.frame object and not on a matrix (`subset` works on all, whereas `dplyr::filter` works only on data.frame-type objects) – Davide Passaretti Nov 27 '14 at 22:27
Just to add my idea: the difference in computational time between `dplyr` (and `data.table`) functions and "base R" functions that work on data frames is very evident. If one is working on small datasets, it's ok to use base R, but otherwise why should one avoid to load packages? – Davide Passaretti Nov 27 '14 at 22:40

score 1 · Answer 2 · answered Nov 27 '14 at 21:54

1

First of all, at least with plot(), there.s no reason to force a data.frame. plot() understands table objects. You can do

plot(table(words$words))
# or 
plot(table(words$words), type="p")
# or 
barplot(table(words$words))

We can use Filter to filter rows, unfortunately that drops the table class. But we can add that back on with as.table. This looks like

plot(as.table(Filter(function(x) x>2, table(words$words))), type="p")

enter image description here

answered Nov 27 '14 at 21:54

MrFlick

195,160
17
277
295

Ok, see, that works, thank you for that. It still doesn't help if you're learning R and are required to go back and forth to work around its inconsistent behaviour. Why would you have a whole bunch of equivalent data structures that arbitrarily lose and gain their status depending on the whims of whatever function you pass them through? And that's how you spend an afternoon and end up ranting on SO. – Morpheu5 Nov 27 '14 at 22:00
I was trying to be fancy by using Filter. That's probably not the most common way to subset a table. Others might do `tt<-table(words$words);tt<-tt[tt>2]; barplot(tt)`. But its really isn't going to help whining about R on this forum. We're not forcing you to use R. No language is perfect; you just need to learn how they work. – MrFlick Nov 27 '14 at 22:03

How do you plot a histogram of the terms that occur n or more times?

2 Answers2