1

I am using ft_tokenizer for spark dataframe in R. and it tokenizes each word and changes it to all lower, I want the words to be in the format they originally are.

text_data <- data_frame(
  x = c("This IS a sentence", "So is this")
)

tokenized <- text_data_tbl %>%
  ft_tokenizer("x", "word")


tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "this"
## 
## [[1]][[2]]
## [1] "is"
##
## [[1]][[3]]
## [1] "a"

I want:

tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "This"
## 
## [[1]][[2]]
## [1] "IS"
##
## [[1]][[3]]
## [1] "a"
Vasudha Jain
  • 93
  • 2
  • 10
  • Why do you think capitalization is important? I am just curious what your idea for nlp. – Sang won kim Aug 26 '19 at 06:53
  • I want to use the tokenized words to match it with a list of keyword, for which I need exact matches. i,e, in the list of keywords, if I have "This", I want an exact match to it, which is not possible right now because tokenizing changes "This" to "this". – Vasudha Jain Aug 26 '19 at 07:05
  • Possible duplicate of [First letter to upper case](https://stackoverflow.com/questions/18509527/first-letter-to-upper-case) – heck1 Aug 26 '19 at 07:06
  • take a look at this: https://stackoverflow.com/questions/18509527/first-letter-to-upper-case/18509816 – heck1 Aug 26 '19 at 07:06
  • No, I don't want to change each first letter to upper case. I just want it to tokenize the word as it is. I will edit my query to be more precise. – Vasudha Jain Aug 26 '19 at 07:08
  • Simple is best. If you want to split to list format and then mapping some keywords, you don't need to use the `ft_tokenizer`. – Sang won kim Aug 26 '19 at 07:54
  • Are `text_data` and `text_data_tbl` supposed to be the same thing? – camille Aug 26 '19 at 13:02
  • yes, but I found a solution. – Vasudha Jain Aug 27 '19 at 04:53

1 Answers1

1

I guess it is not possible with ft_tokenizer. From ?ft_tokenizer

A tokenizer that converts the input string to lowercase and then splits it by white spaces.

So it's basic feature is to convert the string to lowercase and split on white-space which I guess cannot be changed. Consider doing

text_data$new_x <- lapply(strsplit(text_data$x, "\\s+"), as.list)

which will give the same output as expected and you can continue your process as it is from here.

text_data$new_x
#[[1]]
#[[1]][[1]]
#[1] "This"

#[[1]][[2]]
#[1] "IS"

#[[1]][[3]]
#[1] "a"

#[[1]][[4]]
#[1] "sentence"


#[[2]]
#[[2]][[1]]
#[1] "So"

#[[2]][[2]]
#[1] "is"

#[[2]][[3]]
#[1] "this"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • is it possible to apply the same function on spark data frame? since text_data is a spark data frame. – Vasudha Jain Aug 26 '19 at 07:53
  • @VasudhaJain Should be possible but I am not exactly sure how as I do not have a spark dataframe to test this. Is it possible for you to apply this on a dataframe instead using `df <- collect(text_data)` or `df <- as.data.frame(text_data)` ? – Ronak Shah Aug 26 '19 at 08:02
  • my main aim is to make this work on spark dataframe. Even if it works on regular r dataframe, it won't be of any help to me. – Vasudha Jain Aug 26 '19 at 08:58
  • @VasudhaJain I see. I was thinking if you could do this part in regular dataframe, then convert it back to spark dataframe and continue from there. Maybe somebody else might know. – Ronak Shah Aug 26 '19 at 09:04
  • if I do that in regular dataframe, a list will be created. and it's again a challenge to convert the list into spark df. – Vasudha Jain Aug 26 '19 at 09:23