1

I want to implement "Find in Files" similar to one in IDE's using lucene. Basically wants to search in source code files like .c,.cpp,.h,.cs and .xml. I tried the demo shown in apache website. It returns the list of files without line numbers and number of occurance in that file. I am sure there should be some ways to get it.

Is there anyway to get those details?

gramcha
  • 692
  • 9
  • 16

2 Answers2

1

Can you please share the link of the demo shown in apache website?

Here I show you how to get the term frequency of a term given set of documents:

public static void main(final String[] args) throws CorruptIndexException,
            LockObtainFailedException, IOException {

        // Create the index
        final Directory directory = new RAMDirectory();
        final Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
        final IndexWriterConfig config = new IndexWriterConfig(
                Version.LUCENE_36, analyzer);
        final IndexWriter writer = new IndexWriter(directory, config);

        // addDoc(writer, field, text);
        addDoc(writer, "title", "foo");
        addDoc(writer, "title", "buz qux");
        addDoc(writer, "title", "foo foo bar");

        // Search
        final IndexReader reader = IndexReader.open(writer, false);
        final IndexSearcher searcher = new IndexSearcher(reader);

        final Term term = new Term("title", "foo");
        final Query query = new TermQuery(term);
        System.out.println("Query: " + query.toString() + "\n");

        final int limitShow = 3;
        final TopDocs td = searcher.search(query, limitShow);
        final ScoreDoc[] hits = td.scoreDocs;

        // Take IDs and frequencies
        final int[] docIDs = new int[td.totalHits];
        for (int i = 0; i < td.totalHits; i++) {
            docIDs[i] = hits[i].doc;
        }
        final Map<Integer, Integer> id2freq = getFrequencies(reader, term,
                docIDs);

        // Show results
        for (int i = 0; i < td.totalHits; i++) {
            final int docNum = hits[i].doc;
            final Document doc = searcher.doc(docNum);
            System.out.println("\tposition " + i);
            System.out.println("Title: " + doc.get("title"));
            final int freq = id2freq.get(docNum);
            System.out.println("Occurrences of \"" + term.text() + "\" in \""
                    + term.field() + "\" = " + freq);
            System.out.println("--------------------------------\n");
        }
        searcher.close();
        reader.close();
        writer.close();
    }

Here we add the documents to the index:

private static void addDoc(final IndexWriter w, final String field,
            final String text) throws CorruptIndexException, IOException {
        final Document doc = new Document();
        doc.add(new Field(field, text, Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field(field, text, Field.Store.YES, Field.Index.ANALYZED));
        w.addDocument(doc);
}

This is an example of how to take the number of occurrencies of a term in a doc:

public static Map<Integer, Integer> getFrequencies(
        final IndexReader reader, final Term term, final int[] docIDs)
        throws CorruptIndexException, IOException {
    final Map<Integer, Integer> id2freq = new HashMap<Integer, Integer>();
    final TermDocs tds = reader.termDocs(term);
    if (tds != null) {
        for (final int docID : docIDs) {
            // Skip to the next docID
            tds.skipTo(docID);
            // Get its term frequency
            id2freq.put(docID, tds.freq());
        }
    }
    return id2freq;
}

If you put all together and you run it you will obtain this output:

Query: title:foo

    position 0
Title: foo
Occurrences of "foo" in "title" = 2
--------------------------------

    position 1
Title: foo foo bar
Occurrences of "foo" in "title" = 4
--------------------------------
Luca Mastrostefano
  • 3,201
  • 2
  • 27
  • 34
  • [link]http://lucene.apache.org/core/4_3_1/demo/overview-summary.html#overview_description – gramcha Jun 24 '13 at 07:10
  • I did not write any code as of now. I just created lucene index for a directory using given indexfile binary. And searching a word in that index returns filenames which contains that word. But I need bit more inforamtion on this like number of occurrence in that file and line number of match. – gramcha Jun 24 '13 at 07:17
  • The simplest solution is to index separately every line (with a common file_ID and a unique line_number), execute the query and check the results to extract the number of occurrencies and the line in which appear. Otherwise here [link] (http://stackoverflow.com/questions/1311199/finding-the-position-of-search-hits-from-lucene) you can find something similar to what you want. – Luca Mastrostefano Sep 06 '13 at 09:28
-1

I tried many forums, response is zero. So finally I got an idea from @Luca Mastrostefano answer to get the line number details.

Taginfo from lucene searcher returns the file names. I think that is sufficient enough to get the line number. Lucene index is not storing the actual content, it is actually stores the hash values. So it is impossible to get the line number directly. Hence, I assume only way is use that path and read the file and get line number.

public static void PrintLines(string filepath,string key)
    {
        int counter = 1;
        string line;

        // Read the file and display it line by line.
        System.IO.StreamReader file = new System.IO.StreamReader(filepath);
        while ((line = file.ReadLine()) != null)
        {
            if (line.Contains(key))
            {
                Console.WriteLine("\t"+counter.ToString() + ": " + line);
            }
            counter++;
        }
        file.Close();
    }

Call this function after path from lucene searcher.

gramcha
  • 692
  • 9
  • 16