2

I have developed my own Indexer in Lucene 5.2.1. I am trying to index a file of dimension of 1.5 GB and I need to do some non-trivial calculation during indexing time on every single document of the collection.

The problem is that it takes almost 20 minutes to do all the indexing! I have followed this very helpful wiki, but it is still way too slow. I have tried increasing Eclipse heap space and java VM memory, but it seems more a matter of hard disk rather than virtual memory (I am using a laptop with 6GB or RAM and a common Hard Disk).

I have read this discussion that suggests to use RAMDirectory or mount a RAM disk. The problem with RAM disk would be that of persisting index in my filesystem (I don't want to lose indexes after reboot). The problem with RAMDirectory instead is that, according to the APIs, I should not use it because my index is more than "several hundreds of megabites"...

Warning: This class is not intended to work with huge indexes. Everything beyond several hundred megabytes will waste resources (GC cycles), because it uses an internal buffer size of 1024 bytes, producing millions of byte[1024] arrays. This class is optimized for small memory-resident indexes. It also has bad concurrency on multithreaded environments.

Here you can find my code:

public class ReviewIndexer {

private JSONParser parser;
private PerFieldAnalyzerWrapper reviewAnalyzer;
private IndexWriterConfig iwConfig;
private IndexWriter indexWriter;

public ReviewIndexer() throws IOException{
    parser = new JSONParser();
    reviewAnalyzer = new ReviewWrapper().getPFAWrapper();
    iwConfig = new IndexWriterConfig(reviewAnalyzer);
    //change ram buffer size to speed things up
    //@url https://wiki.apache.org/lucene-java/ImproveIndexingSpeed
    iwConfig.setRAMBufferSizeMB(2048);
    //little speed increase
    iwConfig.setUseCompoundFile(false);
    //iwConfig.setMaxThreadStates(24);
    // Set to overwrite the existing index
    indexWriter = new IndexWriter(FileUtils.openDirectory("review_index"), iwConfig);
}

/**
 * Indexes every review. 
 * @param file_path : the path of the yelp_academic_dataset_review.json file
 * @throws IOException
 * @return Returns true if everything goes fine.
 */
public boolean indexReviews(String file_path) throws IOException{
    BufferedReader br;
    try {
        //open the file
        br = new BufferedReader(new FileReader(file_path));
        String line;
        //define fields
        StringField type = new StringField("type", "", Store.YES);
        String reviewtext = "";
        TextField text = new TextField("text", "", Store.YES);
        StringField business_id = new StringField("business_id", "", Store.YES);
        StringField user_id = new StringField("user_id", "", Store.YES);
        LongField stars = new LongField("stars", 0, LanguageUtils.LONG_FIELD_TYPE_STORED_SORTED);
        LongField date = new LongField("date", 0, LanguageUtils.LONG_FIELD_TYPE_STORED_SORTED);
        StringField votes = new StringField("votes", "", Store.YES);
        Date reviewDate;
        JSONObject jsonVotes;
        try {
            indexWriter.deleteAll();
            //scan the file line by line
            //TO-DO: split in chunks and use parallel computation
            while ((line = br.readLine()) != null) {
                try {
                    JSONObject jsonline = (JSONObject) parser.parse(line);
                    Document review = new Document();
                    //add values to fields
                    type.setStringValue((String) jsonline.get("type"));
                    business_id.setStringValue((String) jsonline.get("business_id"));
                    user_id.setStringValue((String) jsonline.get("user_id"));
                    stars.setLongValue((long) jsonline.get("stars"));
                    reviewtext = (String) jsonline.get("text");
                    //non-trivial function being calculated here
                    text.setStringValue(reviewtext);
                    reviewDate = DateTools.stringToDate((String) jsonline.get("date"));
                    date.setLongValue(reviewDate.getTime());
                    jsonVotes = (JSONObject) jsonline.get("votes");
                    votes.setStringValue(jsonVotes.toJSONString());
                    //add fields to document
                    review.add(type);
                    review.add(business_id);
                    review.add(user_id);
                    review.add(stars);
                    review.add(text);
                    review.add(date);
                    review.add(votes);
                    //write the document to index
                    indexWriter.addDocument(review);
                } catch (ParseException | java.text.ParseException e) {
                    e.printStackTrace();
                    br.close();
                    return false;
                }
            }//end of while
        } catch (IOException e) {
            e.printStackTrace();
            br.close();
            return false;
        }
        //close buffer reader and commit changes
        br.close();
        indexWriter.commit();
    } catch (FileNotFoundException e1) {
            e1.printStackTrace();
            return false;
    }
    System.out.println("Done.");
    return true;
}

public void close() throws IOException {
    indexWriter.close();
}

}

What is the best thing to do then? Should I Build a RAM disk and then copy indexes to FileSystem once they are done, or should I use RAMDirectory anyway -or maybe something else? Many thanks

Community
  • 1
  • 1
mazerone
  • 111
  • 1
  • 9
  • 2
    Have you profiled your code to see which portion is taking the most time? I suspect the actual indexing is not slow but your file reading process. – user1071777 Aug 26 '15 at 14:49
  • Do you need to read through the file in a sequential manner? If not you could speed up things with parallelization, splitting the file into chunks and compute the chunks in a process each. – cheffe Aug 27 '15 at 04:59
  • Also, if you have mostly unique terms, indexing will get slower and slower. Can you provide more info on the fields / terms you want to index? – Rob Audenaerde Aug 27 '15 at 07:00
  • @user1071777 no I haven't profiled my code. Are there some simple tools for this? Anyway it seems that the best idea would be that of using parallelization. Are there some libraries for this kind of things, so that I can avoid implementing my own parallel computation pipeline? – mazerone Aug 27 '15 at 08:06
  • @RobAu I have updated my code. The fields are very often containing unique terms. The field "text" though contains free text (is the actual body of the document). – mazerone Aug 27 '15 at 08:28

2 Answers2

1

Lucene claims 150GB/hour on modern hardware - that is with 20 indexing threads on a 24 core machine.

You have 1 thread, so expect about 150/20 = 7.5 GB/hour. You will probably see that 1 core is working 100% and the rest is only working when merging segments.

You should use multiple index threads to speeds things up. See for example the luceneutil Indexer.java for inspiration.

As you have a laptop I suspect you have either 4 or 8 cores, so multi-threading should be able to give your indexing a nice boost.

Rob Audenaerde
  • 19,195
  • 10
  • 76
  • 121
0

You can try setMaxTreadStates in IndexWriterConfig

iwConfig.setMaxThreadStates(50);

Paweł
  • 146
  • 11
  • thank you for your advice. I have tried increasing ThreadStates and it is actually speeding things up a little bit. Maybe I can try with even a higher value, e.g. 100 threads? – mazerone Aug 27 '15 at 09:14