1

For example:

doc1 = "I got the new Apple iPhone 8";
doc2 = "have you seen the  new Apple iPhone 8?";
doc3 = "the Apple iPhone 8 is out";
doc4 = "another doc without the common words";

find_commons(["doc1", "doc2", "doc3", "doc4"]);

results: {{"doc1", "doc2", "doc3"}, {"Apple", "iPhone"}} or something similar

Other question: is there a better library/system to achieve this using Lucene's data?

Cheyenne Forbes
  • 491
  • 1
  • 5
  • 15
  • you mean to say that you wouldn't supply query string? documents should identify common words on their own? – Sabir Khan Apr 17 '17 at 12:40

1 Answers1

1

Yes, you can use the TermVector to retrieve this information.

First, you need to make sure that the TermVectors are stored in the index, e.g.:

private static Document createDocument(String title, String content) {
    Document doc = new Document();

    doc.add(new StringField("title", title, Field.Store.YES));
    FieldType type = new FieldType();
    type.setTokenized(true);
    type.setStoreTermVectors(true);
    type.setStored(false);
    type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    doc.add(new Field("content", content, type));

    return doc;
}

Then, you can retrieve the term vector for a given document id:

private static List<String> getTermsForDoc(int docId, String field, IndexReader reader) throws IOException {
    List<String> result = new ArrayList<>();

    Terms terms = reader.getTermVector(docId, field);
    TermsEnum it = terms.iterator();
    for(BytesRef br = it.next(); br != null; br = it.next()) {
        result.add(br.utf8ToString());
    }

    return result;
}

Finally you can retrieve common terms for two documents:

private static List<String> getCommonTerms(int docId1, int docId2, IndexSearcher searcher) throws IOException {
    // Using the field "content" is just an example here.
    List<String> termList1 = getTermsForDoc(docId1, "content", searcher);
    List<String> termList2 = getTermsForDoc(docId2, "content", searcher);

    termList1.retainAll(termList2);
    return termList1;
}

Of course this can easily be expanded to allow an arbitrary number of documents.

Philipp Ludwig
  • 3,758
  • 3
  • 30
  • 48
  • Can I also use term vectors to find the most common word also? for example if I give 5 doc ids and I run the code on all 5 and only 3 or 4 of the docs have a word in common I'd want to get the word (even if its not in 2 of them) – Cheyenne Forbes Apr 17 '17 at 14:46
  • You could put all terms into one big list and then use something like this: http://stackoverflow.com/questions/19031213/java-get-most-common-element-in-a-list – Philipp Ludwig Apr 17 '17 at 15:24