Using something like strsplit
might not do exactly what you want because of case and punctuation. The tokenizers
package is what is used by tidytext
.
library(tokenizers)
text <- "this is some random TEXT is string 45 things and numbers and text!"
table(tokenize_words(text))
45 and is numbers random some string text things this
1 2 2 1 1 1 1 2 1 1
Notice the difference if you just split on spaces.
table(strsplit(text, " "))
45 and is numbers random some string TEXT text! things this
1 2 2 1 1 1 1 1 1 1 1
If you go this route, you might want to just jump completely to tidytext
.
library(dplyr)
library(tidytext)
library(tibble)
df <- tibble(string = text)
df %>%
unnest_tokens(word, string) %>%
count(word)
# A tibble: 10 x 2
word n
<chr> <int>
1 45 1
2 and 2
3 is 2
4 numbers 1
5 random 1
6 some 1
7 string 1
8 text 2
9 things 1
10 this 1