8

I am using tidytext package in R to do n-gram analysis.

Since I analyze tweets, I would like to preserve @ and # to capture mentions, retweets, and hashtags. However, unnest_tokens function automatically removes all punctuations and convert text into lower case.

I found unnest_tokens has an option to use regular expression using token='regex', so I can customize the way it cleans the text. But, it only works in unigram analysis and it doesn't work with n-gram because I need to define token='ngrams' to do n-gram analysis.

Is there any way to prevent unnest_tokens from converting text into lowercase in n-gram analysis?

JungHwan Yang
  • 181
  • 2
  • 5
  • N.B. `unnest_tokens` makes use of [tokenizers](https://github.com/ropensci/tokenizers) to the do its heavy lifting....And in said project there is [tokenize_tweets.R](https://github.com/ropensci/tokenizers/blob/7f6e06071f143b3962cc5d207f07472c5d97fd9a/R/tokenize_tweets.R) – Shawn Mehan Jun 12 '17 at 23:29
  • Looking at the source, `tokenize_ngrams <- function(x, lowercase = TRUE, n = 3L, n_min = n, stopwords = character(), ngram_delim = " ", simplify = FALSE)`. There is certainly an option to not lowercase in `tokenize_ngrams`. Worst case is to patch. – Shawn Mehan Jun 12 '17 at 23:48
  • Thanks for the comments. I think `unnest_tokens` uses `tokenize_words` to clean text: `tokenize_words <- function(x, lowercase = TRUE, stopwords = NULL, **strip_punct = TRUE**, strip_numeric = FALSE, simplify = FALSE) {...` I changed `strip_punct=FALSE` and run it again but it still doesn't work. – JungHwan Yang Jun 13 '17 at 04:55

2 Answers2

1

Arguments for tokenize_words are available within the unnest_tokens function call. So you can use strip_punct = FALSE directly as an argument for unnest_tokens.

Example:

txt <- data.frame(text = "Arguments for `tokenize_words` are available within the `unnest_tokens` function call. So you can use `strip_punct = FALSE` directly as an argument for `unnest_tokens`. ", stringsAsFactors = F)
unnest_tokens(txt, palabras, "text", strip_punct =FALSE)

 palabras
 1         arguments
 1.1             for
 1.2               `
 1.3  tokenize_words
 1.4               `
 1.5             are
 1.6       available
 1.7          within
 1.8             the
 1.9               `
 1.10  unnest_tokens
 1.11              `
 1.12       function
 1.13           call
 1.14              .
 1.15             so
 #And some more, but you get the point. 

Also available: lowercase = FALSE and strip_numeric = TRUE to change the default opposite behavior.

mpaladino
  • 222
  • 3
  • 10
0

In tidytext version 0.1.9 you now have the option to tokenize tweets and if you don't want lowercase, use the option to_lower = FALSE

unnest_tokens(tweet_df, word, tweet_column, token = "tweets", to_lower = FALSE)
phiver
  • 23,048
  • 14
  • 44
  • 56