ft_tokenizer tokenizes words to lower, I want it to be as they are

Question

I am using ft_tokenizer for spark dataframe in R. and it tokenizes each word and changes it to all lower, I want the words to be in the format they originally are.

text_data <- data_frame(
  x = c("This IS a sentence", "So is this")
)

tokenized <- text_data_tbl %>%
  ft_tokenizer("x", "word")


tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "this"
## 
## [[1]][[2]]
## [1] "is"
##
## [[1]][[3]]
## [1] "a"

I want:

tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "This"
## 
## [[1]][[2]]
## [1] "IS"
##
## [[1]][[3]]
## [1] "a"

Why do you think capitalization is important? I am just curious what your idea for nlp. — Sang won kim, Aug 26 '19 at 06:53
I want to use the tokenized words to match it with a list of keyword, for which I need exact matches. i,e, in the list of keywords, if I have "This", I want an exact match to it, which is not possible right now because tokenizing changes "This" to "this". — Vasudha Jain, Aug 26 '19 at 07:05
Possible duplicate of [First letter to upper case](https://stackoverflow.com/questions/18509527/first-letter-to-upper-case) — heck1, Aug 26 '19 at 07:06
take a look at this: https://stackoverflow.com/questions/18509527/first-letter-to-upper-case/18509816 — heck1, Aug 26 '19 at 07:06
No, I don't want to change each first letter to upper case. I just want it to tokenize the word as it is. I will edit my query to be more precise. — Vasudha Jain, Aug 26 '19 at 07:08
Simple is best. If you want to split to list format and then mapping some keywords, you don't need to use the `ft_tokenizer`. — Sang won kim, Aug 26 '19 at 07:54
Are `text_data` and `text_data_tbl` supposed to be the same thing? — camille, Aug 26 '19 at 13:02

score 1 · Answer 1 · answered Aug 26 '19 at 07:15

1

I guess it is not possible with ft_tokenizer. From ?ft_tokenizer

A tokenizer that converts the input string to lowercase and then splits it by white spaces.

So it's basic feature is to convert the string to lowercase and split on white-space which I guess cannot be changed. Consider doing

text_data$new_x <- lapply(strsplit(text_data$x, "\\s+"), as.list)

which will give the same output as expected and you can continue your process as it is from here.

text_data$new_x
#[[1]]
#[[1]][[1]]
#[1] "This"

#[[1]][[2]]
#[1] "IS"

#[[1]][[3]]
#[1] "a"

#[[1]][[4]]
#[1] "sentence"


#[[2]]
#[[2]][[1]]
#[1] "So"

#[[2]][[2]]
#[1] "is"

#[[2]][[3]]
#[1] "this"

answered Aug 26 '19 at 07:15

Ronak Shah

377,200
20
156
213

is it possible to apply the same function on spark data frame? since text_data is a spark data frame. – Vasudha Jain Aug 26 '19 at 07:53
@VasudhaJain Should be possible but I am not exactly sure how as I do not have a spark dataframe to test this. Is it possible for you to apply this on a dataframe instead using `df <- collect(text_data)` or `df <- as.data.frame(text_data)` ? – Ronak Shah Aug 26 '19 at 08:02
my main aim is to make this work on spark dataframe. Even if it works on regular r dataframe, it won't be of any help to me. – Vasudha Jain Aug 26 '19 at 08:58
@VasudhaJain I see. I was thinking if you could do this part in regular dataframe, then convert it back to spark dataframe and continue from there. Maybe somebody else might know. – Ronak Shah Aug 26 '19 at 09:04
if I do that in regular dataframe, a list will be created. and it's again a challenge to convert the list into spark df. – Vasudha Jain Aug 26 '19 at 09:23

ft_tokenizer tokenizes words to lower, I want it to be as they are

1 Answers1