Preserve punctuations using unnest_tokens() in tidytext in R

Question

I am using tidytext package in R to do n-gram analysis.

Since I analyze tweets, I would like to preserve @ and # to capture mentions, retweets, and hashtags. However, unnest_tokens function automatically removes all punctuations and convert text into lower case.

I found unnest_tokens has an option to use regular expression using token='regex', so I can customize the way it cleans the text. But, it only works in unigram analysis and it doesn't work with n-gram because I need to define token='ngrams' to do n-gram analysis.

Is there any way to prevent unnest_tokens from converting text into lowercase in n-gram analysis?

N.B. `unnest_tokens` makes use of [tokenizers](https://github.com/ropensci/tokenizers) to the do its heavy lifting....And in said project there is [tokenize_tweets.R](https://github.com/ropensci/tokenizers/blob/7f6e06071f143b3962cc5d207f07472c5d97fd9a/R/tokenize_tweets.R) — Shawn Mehan, Jun 12 '17 at 23:29
Looking at the source, `tokenize_ngrams <- function(x, lowercase = TRUE, n = 3L, n_min = n, stopwords = character(), ngram_delim = " ", simplify = FALSE)`. There is certainly an option to not lowercase in `tokenize_ngrams`. Worst case is to patch. — Shawn Mehan, Jun 12 '17 at 23:48
Thanks for the comments. I think `unnest_tokens` uses `tokenize_words` to clean text: `tokenize_words <- function(x, lowercase = TRUE, stopwords = NULL, **strip_punct = TRUE**, strip_numeric = FALSE, simplify = FALSE) {...` I changed `strip_punct=FALSE` and run it again but it still doesn't work. — JungHwan Yang, Jun 13 '17 at 04:55

score 1 · Answer 1 · answered Aug 03 '18 at 20:11

Arguments for tokenize_words are available within the unnest_tokens function call. So you can use strip_punct = FALSE directly as an argument for unnest_tokens.

Example:

txt <- data.frame(text = "Arguments for `tokenize_words` are available within the `unnest_tokens` function call. So you can use `strip_punct = FALSE` directly as an argument for `unnest_tokens`. ", stringsAsFactors = F)
unnest_tokens(txt, palabras, "text", strip_punct =FALSE)

 palabras
 1         arguments
 1.1             for
 1.2               `
 1.3  tokenize_words
 1.4               `
 1.5             are
 1.6       available
 1.7          within
 1.8             the
 1.9               `
 1.10  unnest_tokens
 1.11              `
 1.12       function
 1.13           call
 1.14              .
 1.15             so
 #And some more, but you get the point.

Also available: lowercase = FALSE and strip_numeric = TRUE to change the default opposite behavior.

score 0 · Answer 2 · answered Jun 03 '18 at 07:58

0

In tidytext version 0.1.9 you now have the option to tokenize tweets and if you don't want lowercase, use the option to_lower = FALSE

unnest_tokens(tweet_df, word, tweet_column, token = "tweets", to_lower = FALSE)

answered Jun 03 '18 at 07:58

phiver

23,048
14
44
56

Preserve punctuations using unnest_tokens() in tidytext in R

2 Answers2