1

I have a dataset similar to this and have created a bar graph with ggplot to show how many times a person says a certain word.

name <- c('Luca', 'Marco','Alberto', 'Luca', 'Marco', 'Luca', 'Alberto', 'Marco')
word <- c('pizza', 'cola', 'pizza','cola','pizza', 'good', 'good', 'chips')
count <- c(3,5,6,4,1,3,6,2)
  
ggplot(df, aes(y=word, x=count, fill=name)) + 
  geom_col()

image

This is the result. However, I want to display only a part of the image, ie the first two most frequent words.This is the result. However I want to display only a part of the image (the first two most frequent words). This is a simplification of my real database, because in that one I have about 30k thousand words and I would like to take only the first 20. Thank you all

GIORIGO
  • 49
  • 2
  • Can [this post](https://stackoverflow.com/questions/17374651/find-the-n-most-common-values-in-a-vector) or [this one](https://stackoverflow.com/questions/14800161/select-the-top-n-values-by-group) help? – Rui Barradas Jan 01 '21 at 16:01
  • I'm looking for a command to enter directly when creating the plot, because I have to consider the whole dataset and then zoom in the first 20 words – GIORIGO Jan 01 '21 at 16:15

1 Answers1

0

Here is a solution based on table to get the n most frequent words and then plot them.
But first the test data set, since in the question the data.frame was not created.

name <- c('Luca', 'Marco','Alberto', 'Luca', 'Marco', 'Luca', 'Alberto', 'Marco')
word <- c('pizza', 'cola', 'pizza','cola','pizza', 'good', 'good', 'chips')
count <- c(3,5,6,4,1,3,6,2)
df <- data.frame(name, word, count)

Now the plotting function fun. The default n = 2 will plot the two most frequent words so I do not need to pass that value when calling the function, only if plotting another number of most frequent words.

library(ggplot2)

fun <- function(X, col, count, fill, n = 2){
  freq <- sort(table(X[[col]]), decreasing = TRUE)[seq_len(n)]
  i <- which(X[[col]] %in% names(freq))
  df_plot <- X[i, , drop = FALSE]
  g <- ggplot(df_plot, aes(get(col), get(count), fill = get(fill))) + 
    geom_col() +
    labs(x = col, y = count, fill = fill) +
    coord_flip()
  g
}

fun(df, "word", "count", "name")

enter image description here

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Ok thanks friend. it's better but I don't understand why it doesn't work in my dataset. I modified your command to fit the characteristics of my dataset, but it doesn't show some words that have a very high frequency. What do you think could be the problem? Thanks for your help – GIORIGO Jan 01 '21 at 18:03
  • @GIORIGO Does the `sort(table(etc))[seq_len]` return the top `n` words you want? – Rui Barradas Jan 01 '21 at 20:18
  • not all, only a part. I don't know, I think I'll opt for another representation. Thanks – GIORIGO Jan 02 '21 at 09:11