I have a list of words coming straight from file, one per line, that I import with read.csv which produces a data.frame. What I need to do is to compute and plot the numbers of occurences of each of these words. That, I can do easily, but the problem is that I have several hundreds of words, most of which occur just once or twice in the list, so I'm not interested in them.
EDIT https://gist.github.com/anonymous/404a321840936bf15dd2#file-wordlist-csv here is a sample wordlist that you can use to try. It isn't the same I used, I can't share that as it's actual data from actual experiments and I'm not allowed to share it. For all intents and purposes, this list is comparable.
A "simple"
df <- data.frame(table(words$word))
df[df$Freq > 2, ]
does the trick, I now have a list of the words that occur more than twice, as well as a hard headache as to why I have to go from a data.frame to an array and back to a data.frame just to do that, let alone the fact that I have to repeat the name of the data.frame in the actual selection string. Beats me completely.
The problem is that now the filtered data.frame is useless for charting. Suppose this is what I get after filtering
Var1 Freq
6 aspect 3
24 colour 7
41 differ 18
55 featur 7
58 function 19
81 look 4
82 make 3
85 mean 7
95 opposit 14
108 properti 3
109 purpos 6
112 relat 3
116 rhythm 4
118 shape 6
120 similar 5
123 sound 3
obviously if I just do a
plot(df[df$Freq > 2, ])
I get this
which obviously (obviously?) has all the original terms on the x axis, while the y axis only shows the filtered values. So the next logical step is to try and force R's hand
plot(x=df[df$Freq > 2, ]$Var1, y=df[df$Freq > 2, ]$Freq)
But clearly R knows best and already did that, because I get the exact same result. Using ggplot2 things get a little better
qplot(x=df[df$Freq > 2, ]$Var1, y=df[df$Freq > 2, ]$Freq)
(yay for consistency) but I'd like that to show an actual histograms, y'know, with bars, like the ones they teach in sixth grade, so if I ask that
qplot(x=df[df$Freq > 2, ]$Var1, y=df[df$Freq > 2, ]$Freq) + geom_bar()
I get
Error : Mapping a variable to y and also using stat="bin".
With stat="bin", it will attempt to set the y value to the count of cases in each group.
This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
If you want y to represent values in the data, use stat="identity".
See ?geom_bar for examples. (Defunct; last used in version 0.9.2)
so let us try the last suggestion, shall we?
qplot(df[df$Freq > 2, ]$Var1, stat='identity') + geom_bar()
fair enough, but there are my bars? So, back to basics
qplot(words$word) + geom_bar() # even if geom_bar() is probably unnecessary this time
gives me this
Am I crazy or [substitute a long list of ramblings and complaints about R]?