I am using tidytext
package in R
to do n-gram analysis.
Since I analyze tweets, I would like to preserve @ and # to capture mentions, retweets, and hashtags. However, unnest_tokens
function automatically removes all punctuations and convert text into lower case.
I found unnest_tokens
has an option to use regular expression using token='regex'
, so I can customize the way it cleans the text. But, it only works in unigram analysis and it doesn't work with n-gram because I need to define token='ngrams'
to do n-gram analysis.
Is there any way to prevent unnest_tokens
from converting text into lowercase in n-gram analysis?