0

So I have a variable "Body" that has observations which are sentences. I have another variable "Postcategory" which observations are either "Low-quality post" or "High-quality post". I have counted the words of each observation in variable "Body" and now I want to make a boxplot where one can see the median of words in "Body"'s observations for both Low-quality and High-quality post.

img

As I counted the number of words in each sentence of "Body", I used the following code

lengths(strsplit(data$Body, '\\S+'))
    
word <- lengths(strsplit(data$Body, '\\S+'))

I then assigned the result to value "word". I used the following code in trying to create the boxplot with ggplot2.

geom_boxplot(outlier.colour="black", outlier.shape=16,
                 outlier.size=2, notch=FALSE)
    
ggplot(data, aes(x=Postcategory, y=word)) + geom_boxplot()

I know it's wrong but I can't seem to find solution for what to do get the result I want.

I also made a quick sketch of how I would like the final box plot to look like (The values are not correct)

img

Z.Lin
  • 28,055
  • 6
  • 54
  • 94
  • Can you post some reproducible data? https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – william3031 Dec 16 '20 at 05:15

1 Answers1

0

Add the lengths of each entry as a new column in the dataframe. Then specify lengths as the y in aes()

#Generate some data
body = c("Here's a sentence", "Another sentence would go here", "And here yet again another sentence", "But wait, another sentence?", "Why yes", "It's another sentence", "I hate sentences", "They're the worst")

group = rep(c("Low Quality Post","High Quality Post"),4)

#Combine into dataframe 
df = as.data.frame(cbind(body,group))

#Make new variable in data set that is length of each entry in body
df$lengths = lengths(strsplit(df$body, '\\S+'))

ggplot(data = df,aes(x = group, y = lengths)) +
  geom_boxplot(outlier.shape = 16, outlier.size = 2, notch = FALSE) + #False is default so not needed here
labs(y = "Number of words", x = "")
Jonni
  • 804
  • 5
  • 16