I am using ft_tokenizer for spark dataframe in R. and it tokenizes each word and changes it to all lower, I want the words to be in the format they originally are.
text_data <- data_frame(
x = c("This IS a sentence", "So is this")
)
tokenized <- text_data_tbl %>%
ft_tokenizer("x", "word")
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "this"
##
## [[1]][[2]]
## [1] "is"
##
## [[1]][[3]]
## [1] "a"
I want:
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "This"
##
## [[1]][[2]]
## [1] "IS"
##
## [[1]][[3]]
## [1] "a"