best approach to remove documents which contains similar content

Question

I would like to know best approaches to solve a problem where i need to find similarity between two documents which contents the same information elaborates and illustrated in different way.

Example : multiple news sources reports same news different way and i need to remove all that is similar and keep only one article in other words dedupliction of articles

Also history of articles maintained if the article already being received with similar content we need to avoid such articles

In a scenario as above how to identify article similarity.

i have been reading about scoring algorithms and seems to me cosine similarity does better job but performance is a consideration when text to be compared become larger time complexity goes high O(m+n) given one document contain m length text and other given n length text

retrieval of documents in history adds up it will make that is a impractical solution

lucene seems good option but i do not have the priviledges to incorporate that in my solution

i need pure java based solution implemented

If the duplicate link doesn't help then drop a comment stating what you have tried and someone can reopen your question. — Tim Biegeleisen, Aug 20 '17 at 02:27
Note that you need to be **clear and detailed** about what approaches you have already tried, **and** why they didn't work for you. Otherwise, we will waste our time repeating various things that you tried ... but didn't tell us. — Stephen C, Aug 20 '17 at 03:15
@StephenC thank you for the advice and i will make sure i follow the same here i was looking for solution and actually what i wanted was to get narrow down best possible solution set to proceed with as i need to design a solution so time is limited to try out and figure out what is best and what is not and that is why i wanted expert judgments. — pubudut, Aug 21 '17 at 02:20
So ... basically ... you want some experts to do some research for you, and present you with a summary or a short list. For free. You now that isn't going to happen. — Stephen C, Aug 21 '17 at 02:35
And even assuming that someone was to do your research for you, how are they going to identify "best" solutions when they don't know what the actual problem is, or what your criteria for "best" are? 'Cos you provided almost zero detail to answer those questions. — Stephen C, Aug 21 '17 at 02:38
Of course i didn't mean that i just wanted solutions that worked for this kind of problem do not take it wrong way i just dont wanted to sit until people doing hard work i wanted an advice that is all some guidance nothing more than that and no disrespect for the people who is doing good job for helping out people and their time — pubudut, Aug 21 '17 at 02:38
Well the best advice I can offer is that you would be better off doing your own research. Start out by reading the relevant Wikipedia pages, and Googling. Then look at potentially relevant tools ... or books / papers on how it is done. Then ask questions. Looking for textual differences / similarities is well understood: search about "plagiarism detection". Looking for semantic differences and similarities is HARD. (Probably too hard for a practical solution.) — Stephen C, Aug 21 '17 at 02:42
i updated my question i will add up my solution here if helps someone who is looking similar thing — pubudut, Aug 21 '17 at 02:59

best approach to remove documents which contains similar content

0 Answers0