I would like to know best approaches to solve a problem where i need to find similarity between two documents which contents the same information elaborates and illustrated in different way.
Example : multiple news sources reports same news different way and i need to remove all that is similar and keep only one article in other words dedupliction of articles
Also history of articles maintained if the article already being received with similar content we need to avoid such articles
In a scenario as above how to identify article similarity.
i have been reading about scoring algorithms and seems to me cosine similarity does better job but performance is a consideration when text to be compared become larger time complexity goes high O(m+n) given one document contain m length text and other given n length text
retrieval of documents in history adds up it will make that is a impractical solution
lucene seems good option but i do not have the priviledges to incorporate that in my solution
i need pure java based solution implemented