1

I found many implementation for bag of words but still cannot find easy one for simple, long string. My result would like to be like:

word1:     56
word2:     31
word:X     7

I have a problem with qdap library because in does not work on my R...

heisenberg7584
  • 563
  • 2
  • 10
  • 30
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Oct 21 '19 at 18:38
  • Just use. `table` on the split words – akrun Oct 21 '19 at 18:39

1 Answers1

3

Using something like strsplit might not do exactly what you want because of case and punctuation. The tokenizers package is what is used by tidytext.

library(tokenizers)

text <- "this is some random TEXT is string 45 things and numbers and text!"

table(tokenize_words(text))

     45     and      is numbers  random    some  string    text  things    this 
      1       2       2       1       1       1       1       2       1       1 

Notice the difference if you just split on spaces.

table(strsplit(text, " "))

     45     and      is numbers  random    some  string    TEXT   text!  things    this 
      1       2       2       1       1       1       1       1       1       1       1

If you go this route, you might want to just jump completely to tidytext.

library(dplyr)
library(tidytext)
library(tibble)

df <- tibble(string = text)

df %>%
  unnest_tokens(word, string) %>%
  count(word)

# A tibble: 10 x 2
   word        n
   <chr>   <int>
 1 45          1
 2 and         2
 3 is          2
 4 numbers     1
 5 random      1
 6 some        1
 7 string      1
 8 text        2
 9 things      1
10 this        1