I am able to load the data from mongo db collection to spark dataframes. I have used mongo spark connector for this.
I now want to extract the words in the textual fields in the data frame in order to create a word dictionary which maps words to documents.
So, basically, the extracted word should be the key and the value would be the docId from the document.
I am not sure how to parse and extract words from textual column in data frame so that I can map them accordingly to the corresponding documents.
After mapping, I also want to reduce them so that I have the word as key and value as the list of documents which contain the word.
Can some help me with the approach/code to extract words from textual columns in data frame.