0

I have a program in C# whose psuedocode looks like this. The program runs fine except it takes 5 days to run. I need to run this every day for new criteria. We are trying to tag every flower name that is found in Biology books. Each day it is a different textbook or journal in .txt format. We are using Lucene search to make it faster. We have List of FLowerFamilyID in the database. We also have FlowerID| FlowerCommon Name in a csv file. This csv file has about 10,000 entries.

Step 1// Get FlowerFamilyID from SQL server & dump in a text file. This has about 100 entries

FlowerFamilyID FamilyName 
1             Acanthaceae
2                 Agavaceae

Step 2// A csvfile has flowerid, flowercommonname. Read csv file (about 10000 entries) and store them in a list eg:

1|Rose   
2|American water willow 
3|false aloe

Step 3// Lucene index is created on Book/Journal for that day

Step 4// for every familyflowerID from datatextfile call SearchFlower(flowerfamilyID, flower list). returns all flowers found in that book/journal

Step 5// In search function I call Lucene query parser & search for 10000 flower entries, if found store the first hit with score in a list

        public static List<flowerResult> searchText(String flowerfamilyid, List<flower> flowers)
    {
        DateTime startdate = DateTime.Now;   
        List<flowerResult> results = new List<flowerResult>();
        Document doc = new Document();   

        foreach (var flower in flowers)
        {           
            string[] separators = { ",", ".", "!", "?", ";", ":", " " };
            string value = flower.getFlower().Trim().ToLower();
            string[] words = value.Split(separators, StringSplitOptions.RemoveEmptyEntries);
             String criteria = string.Empty;
            if (words.Length > 1)
                criteria = "\"" + value+ "\"";
            else
                criteria = value;

            if (string.IsNullOrEmpty(criteria))
                continue;
            criteria = criteria.Replace("\r", " ");
            criteria = criteria.Replace("\n", " ");

            QueryParser queryParser = new QueryParser(VERSION, "body", analyzer);
            string special = " +body:" + criteria;
            Query query = queryParser.Parse(special);      
            try
            {      
                IndexReader reader = IndexReader.Open(luceneIndexDirectory, true);
                Searcher indexSearch = new IndexSearcher(reader);
                TopDocs hits = indexSearch.Search(query, 1);
                if (hits.TotalHits > 0)
                {
                    float score = hits.ScoreDocs[0].Score;
                    if (score > MINSCORE)
                    {
              flowerResult result = new flowerResult(flower.getId(), flower.getFlower(), score);
                        results.Add(result);
                    }
                }      
                indexSearch.Dispose();
                reader.Dispose();
                indexWriter.Dispose();
            }
            catch (ParseException e)
            {//"Could not parse article. Details: " + e.Message);
              }
        }

        return results;
    }

public class flower
{
    public long flowerID {get;set;}
    public string familyname {get;set;}
    public string flower {get;set;} //common name
}

I tried running this, & it completed in 5 days. But I need to finish this within a day bcoz results are used for further analysis. So, I split up the csv file into 10 different files and the job completed in 2 days. I was told by team leader to use multiple threads to enhance the speed. I have no clue how to do that. Can somebody help me?

Thanks R

newbieCSharp
  • 181
  • 2
  • 22
  • You would have to show a lot more code to have a hope of getting some insight. The way to resolve this would be to profile the code to see where the time is taken - it's likely that an expensive operation is repeated more times than necessary. The timings for this volume of data sound far too high - I'd think minutes was too high! – Charles Mager Jun 29 '14 at 11:12
  • For starters throw in a generous amount of `Console.WriteLine(DateTime.Now.ToString()` commands, just about everywhere, and certainly between each step you mentionend. Then come back! – TaW Jun 29 '14 at 11:18
  • I did throw in Console statements. In Step 5, I have a for loop for 10,000 entries. evry time I need to read csv file get next flower. THis is taking long time. Lucene search is very fast – newbieCSharp Jun 29 '14 at 11:24
  • Show us some code - you should be able to read all 10,000 items from a CSV in under a second. – Charles Mager Jun 29 '14 at 11:26
  • Are you reading the same text file 10.000 times? This should still be faster than hours. But why not keep it in memory? If step 5 is the culprit can you show us the code? – TaW Jun 29 '14 at 11:27
  • 1
    It doesn't appear that this code contains anything to do with reading from a CSV. Is this the slow part? From my brief search on Lucene.NET, creating an `IndexReader` is 'extremely expensive'. I would suggest creating the `IndexReader` & `IndexSearcher` in the outer scope and re-using this in the loop. – Charles Mager Jun 29 '14 at 11:35
  • Yes @CharlesMager - I agree: create the IndexWriter, IndexReader and IndexSearcher outside the `foreach` statement. That should help enormously. Also see [How to make searching faster](http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) and [Optimizing Lucene performance](http://stackoverflow.com/questions/668441/optimizing-lucene-performance). – groverboy Jun 30 '14 at 01:41

1 Answers1

-1

I've had a similar problem to this before and have found the foreach to usually be the issue. Here you're iterating the list of 10000 entries. While this isn't a problem in itself you've not said what the flowers class looks like. Is it just a couple of strings?

Now, when you're dealing with a foreach and the foreach is iterating on a class, what you actually do is create an instance of the class, copy the pointed object of the class into the new instance and work on that. So for example your class may look like this

public class flower
{
    public string genus {get;set;}
    public string colour {get;set;}
    public int occurrences {get;set;}
    public int page {get;set;}
    public int chapter {get;set;}
    public string commonName {get;set;}
}

You then have a list of 10000 of these.

Each time you have the foreach, you have

foreach(Flowers myFlower in flowers)

this creates an instance of Flowers for each iteration of flowers

You then perform some sort of string operation on the flower (I don't know what getFlower does, but I'll assume it returns a string).

Next you check for the string being null or empty, continue if it does else do the search. The search though is going to be slow and the .Dispose at the end isn't needed (as soon as you go out of scope, the instance is destroyed and the GC will do what it needs to)

How I would do it is to only operate the foreach on whatever it needs to use from within the flowers list and not anything else. It may also be possible to refactor the query so the instantiation is performed once instead of every time within the loop. Another trick I've found is to use LINQ if possible. You are also creating 10000 times the separator string - take it outside the loop.

if (words.Length > 1)
            criteria = "\"" + value+ "\"";
        else
            criteria = value;

can be rewritten as

criterial = words.Length > 1 ? "\"" + value+ "\"" : value;

and

flowerResult result = new flowerResult(flower.getId(), flower.getFlower(), score);
results.Add(result);

can be combined into one line (avoids having to create a new variable)

results.Add(flower.getId(), flower.getFlower(), score);

Here you're also calling getFlower for a second time on the same processed data - I think you used value as the variable name. Remember, every time you call an external method, things are going to slow down

These are all assumptions based on what you've put here.

One other thing to try is to not use a strong type name. For example

IndexReader reader = IndexReader.Open(luceneIndexDirectory, true);
Searcher indexSearch = new IndexSearcher(reader);
TopDocs hits = indexSearch.Search(query, 1);

are all fine, but when I've run tests with .NET 4, the var form is faster depending on what you do. Therefore try

var reader = IndexReader.Open(luceneIndexDirectory, true);
var indexSearch = new IndexSearcher(reader);
var hits = indexSearch.Search(query, 1);

The only time I've found them slower is when you use primitives (such as int, double and bool)

Little things, but they may speed things up.

Nodoid
  • 1,449
  • 3
  • 24
  • 42
  • `var` compiles to **identical** IL as not using `var`. http://stackoverflow.com/questions/356846/will-using-var-affect-performance I think any of what you suggestion would add to microseconds - the issues is in days, so I think there're bigger issues than this. The compiler would optimise away most of these changes. `Dispose()` may be needed depending on type - GC will not clean up unmanaged resources. – Charles Mager Jun 29 '14 at 19:39
  • Thanks for the clarification on the compilation of var. I'm aware the GC won't clean everything up. I have a feeling that the slowness here is going to be down to passing massive amounts of data around within the loop instead of just what is is needed or in the undocumented (here) methods something going wrong there. – Nodoid Jun 29 '14 at 22:13
  • Hi everybody, I ran my program with the changes but didn't make much difference. I decided to make it multithreaded and ended up dividing csv file to 10 sub files.I ran 10 instances of the program. this reduced the time from 5 days to 18 hours. – newbieCSharp Jul 01 '14 at 17:23
  • @newbieCSharp seriously, single threaded this should take a matter of minutes at most - possibly even seconds. Talk of days or hours is well off. Is there anywhere you can put all the code & a sample file? I'd be happy to have a look! – Charles Mager Jul 02 '14 at 07:50