Would have to build a Tf-idf matrix/dataframe with terms/words as column names instead of indices using sparklyr. I went with ft_count_vectorizer because of its provision to store vocabulary. But I am stuck after finding the tf-idf i am unable to map the terms to its tf-idf values.Any help in this space would be highly appreciated.Here is what I tried.
tf_idf<-cleantext %>%
ft_tokenizer("Summary", "tokenized") %>%
ft_stop_words_remover(input.col = "tokenized", output.col = "clean_words",
ml_default_stop_words(sc,language = ("english"))) %>%
ft_count_vectorizer(input_col = "clean_words",output_col="tffeatures")%>%
ft_idf(input_col="tffeatures",output_col="tfidffeatures")
tf-idf is a spark_tbl class which would also include clean_words(vocabulary) and tfidf features.Both these features are present as lists. I need to provide tfidf features as an input with clean_words as the column headings. What is the best way to do it. I am stuck here. Any assistance or help would be highly appreciated.