I need help tidying data for topic modelling

Question

I am fairly new to R but I am having a problem with part of text pre-processing and cleaning before topic modelling. I am trying to Tokenise text to turn each document into a list of words- punctuation is removed as part of this process - column is called text

tokens <- text_input %>% unnest_tokens(words, text)

but I keep getting the error message

Error in UseMethod("unnest_tokens_") : 
  no applicable method for 'unnest_tokens_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

My text data is currently

text     <chr> "mr smiths tenant called for support "
...

I need each document to be turned into a list of words so spell checking etc can be completed and then topic modelling

Code already tried
Basic dataframe called input and then text_input
Database: spark_connection

$ lines    <chr> "  mr smiths tenant called for support    "

# set the name of the column with your source text

text_col <- "lines"

## Basic cleaning

text_input <- input %>%   
 filter(!is.na(!!as.name(text_col))) %>%  
 mutate(text = trimws(!!as.name(text_col)))%>%
 mutate(text = tolower(text))

## Tokenise Text
## Turns each document into a list of words; punctuation is removed as part of this process

tokens <- text_input %>% unnest_tokens(words, text)

Error in UseMethod("unnest_tokens_") : 
  no applicable method for 'unnest_tokens_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

Hello, Can you please share the whole code, please look at: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Chelmy88, Aug 26 '19 at 14:26
try `tokens <- text_input %>% unnest_tokens(word, text, token = "text_input")` — Manuel F, Aug 26 '19 at 14:28
this is what I get Basic Df called input and then text_input Database: spark_connection $ lines " mr smiths tenant called for support $ text "mr smiths tenant called for support # set the name of the column with your source text text_col <- "lines" ## Basic cleaning ```{r} text_input <- input %>% filter(!is.na(!!as.name(text_col))) %>% mutate(text = trimws(!!as.name(text_col)))%>% mutate(text = tolower(text)) — dazedandconfused, Aug 26 '19 at 14:38
ctd ## Tokenise Text Turns each document into a list of words- punctuation is removed as part of this process ```{r} tokens <- text_input %>% unnest_tokens(words, text) ``` — dazedandconfused, Aug 26 '19 at 14:38
It's much easier to follow your code if you [edit] the question instead of putting it unformatted in comments — camille, Aug 26 '19 at 14:39

I need help tidying data for topic modelling

0 Answers0