Extract words from text columns in Dataframe to create a dictionary of words to documents

Question

I am able to load the data from mongo db collection to spark dataframes. I have used mongo spark connector for this.

I now want to extract the words in the textual fields in the data frame in order to create a word dictionary which maps words to documents.

So, basically, the extracted word should be the key and the value would be the docId from the document.

I am not sure how to parse and extract words from textual column in data frame so that I can map them accordingly to the corresponding documents.

After mapping, I also want to reduce them so that I have the word as key and value as the list of documents which contain the word.

Can some help me with the approach/code to extract words from textual columns in data frame.

score 0 · Answer 1 · edited May 23 '17 at 12:26

0

Basically what you are saying is :

1) Document collection 2) Words Collection with mapping on how many documents contain this word

This approach is not efficient because if you have 1000 documents and total words may be like 10,000 or more. Now below are the limitations :

1) You need to have records for all words with document mapping which is a lot of data 2) If a document is updated, you need to update all words mapping which are removed and add mapping to newly added words 3) If there are only 30-40 words getting searched frequently, you are un-neccessarily storing all the words.

Instead, keep your documents simple. Add a $text index to the documents content field from which you need to do a full text search.

If your application is specifically dealing with text search only, go for elasticsearch instead of mongodb. Check this answer which i wrote sometime back : MongoDB: Text search (exact match) using variable

Elasticsearch is built on lucene engine which is extremely efficient for text searches.

edited May 23 '17 at 12:26

Community

1
1

answered May 08 '17 at 11:38

Mihir Bhende

8,677
1
30
37

Thanks for your response. My requirement is to use the populated collection to provide auto-suggestions to the user similar to Google Suggest whenever the user types. With Mongo DB Text $text , I am not able to provide the list of words to the user, though I am able to search for a specific text in a document. Due to some enterprise decision taken in my organization, , we cannot use ElasticSearch and we have to leverage on Mongo DB capabilities for accomplishing this requirement. – Jbaur May 08 '17 at 12:14
So if i type pley, it will suggest me play? Or you also want to populate title of documents having play, player, playing? OR you also want mongodb to look into content of document and not just the title and look for the work play? – Mihir Bhende May 08 '17 at 12:19
When the user types 'A', words starting with A will be fetched from the from indexed collection which was populated by extracting words from Source Collection. The indexed collection would have documents with the "_id" field as the value of the extracted word. The other fields will be the document-Id of document having this field. This can be done with regex like query. The returned words would be sent to the user. When user selects a word, we will lookup the word in the indexed collection, get the document locations from there and then query the source collection to fetch the actual docs. – Jbaur May 08 '17 at 13:12
Basically, we want to be able search on any text type fields in the document so that we can provide suggestions accordingly to the user. We can even provide suggestions to the user , in the way you have suggested, if feasible. – Jbaur May 08 '17 at 13:14
Okay I got you. There are few things to take care : 1. You need to have a kind of protection for minimum words. For example if i just type a, ther ecan be huge number of possibilities. If you have min limit for autosuggest to 3 then it will be more efficient. – Mihir Bhende May 08 '17 at 13:22
Secondly, if you have fields indexed with $text, you can always search for a document containing your search query using regex and $text find query in mongodb. This will give you all the documents which contain this search query. Then you can apply limit if needed. – Mihir Bhende May 08 '17 at 13:24
As I mentioned, having 2 different collections, one for words and one for documents will be an extra overhead for you to manage. Also their is an additional $lookup(join) query you need do to get documents from their _ids which you got from find query. – Mihir Bhende May 08 '17 at 13:25

Extract words from text columns in Dataframe to create a dictionary of words to documents

1 Answers1