0

First I build a list (by reading existing files) of approximately 12,000 objects that look like this:

public class Operator
{
    string identifier; //i.e "7/1/2017 MN01 Day"
    string name1;
    string name2;
    string id1;
    string id2;
}

The identifier will be unique within the list.

Next I run a large query (currently about 4 million rows but it could be as large as 10 million, and about 20 columns). Then I write all of this to a CSV line by line using a write stream. For each line I loop over the Operator list to find a match and add those columns.

The problem I am having is with performance. I expect this report to take a long time to run but I've determined that the file writing step is taking especially long (about 4 hours). I suspect that it has to do with looping over the Operator list 4 million times.

Is there any way I can improve the speed of this? Perhaps by doing something when I build the list initially (indexing or sorting, maybe) that will allow searching to be done much faster.

Tagc
  • 8,736
  • 7
  • 61
  • 114
Dan
  • 21
  • 2
  • 3
    "I write all of this to a CSV" show your code to us – Aleks Andreev Jul 31 '17 at 17:58
  • 2
    If the file writing step is the one that's taking so long, that's probably code worth showing... – Rufus L Jul 31 '17 at 18:06
  • Creating a Dictionary of Operator instances where the identifier is the key may also speed up finding the match. – Dan Def Jul 31 '17 at 18:09
  • Create a dictionary. To find a match on N iterating through each item will take average of N/2 searches. Using a dictionary with a binary hash the search time is reduced to log2(N). So your search time is reduced from 4,000,000/2 (2,000,000) to 22. – jdweng Jul 31 '17 at 18:18
  • You can definitely speed up the looping by putting your 12,000 objects into a `Dictionary` or `Lookup` based on whatever you need to compare from your db results, but I guarantee that the IO (both retrieving from the DB and writing to the file) will dwarf the amount of time it take to loop over 12,000 items. – juharr Jul 31 '17 at 18:31
  • Thanks everybody for your help, I didn't have my project when I made this post so I did the best I could to describe what I thought was the problem. Implementing a dictionary was exactly what I was looking for - I'll update with details about how much this improved performance when I have it. – Dan Aug 01 '17 at 16:51

3 Answers3

7

You should be able to greatly speed up your code by building a Dictionary(HashTable):

var items = list.ToDictionary(i => i.identifier, i => i);

You can then index in on this dictionary:

var item = items["7/1/2017 MN01 Day"];

Building the dictionary is an O(n) operation, and doing a lookup into the dictionary is an O(1) operation. This means that your time complexity becomes linear rather than exponential.

willaien
  • 2,647
  • 15
  • 24
0

... but also, "couldn't you somehow put those operators into a database table, so that you could use some kind of JOIN operation in your SQL?"

Another possibility that comes to mind is ... "twenty different queries, one for each symbol." Or, a UNION query with twenty branches. If there is any way for the SQL engine to use indexes, on its side, to speed up that process, you would still come out ahead.

Right now, vast amounts of time might be being wasted, packaging up every one of those millions of lines, squirting them through the network wires to your machine, only to have to discarding most of them, say, because they don't match any symbol.

If you control the database and can afford the space, and if, say, most of the rows don't match any symbol, consider a symbols table and a symbols_matched table, the second being a many-to-many join table that pre-identifies which rows match which symbol(s). It might well be worth the space, to save the time. (The process of populating this table could be put to a stored procedure which is TRIGGERed by appropriate insert, update, and delete events ...)

Mike Robinson
  • 8,490
  • 5
  • 28
  • 41
-1

It is difficult to tell you how to speed up your file write without seeing any code.

But in general it could be worth considering writing using multiple threads. This SO post has some helpful info, and you could of course Google for more.

Sach
  • 10,091
  • 8
  • 47
  • 84
  • Careful, Sach ... threads can be a real headache because *one* database server is now being tasked with listening to multiple threads at the same time ... and what all of those threads are doing, naturally conflicts with what other copies are doing. This can actually make performance considerably worse – entirely depending on the situation, of course. Threads can be a blessing, or a curse. :-) ### If the user does use a hash-table, that's going to be a dramatic improvement (maybe), even without threads, because "it's a better algorithm than linear search." ### JM2CW. – Mike Robinson Jul 31 '17 at 18:29
  • @MikeRobinson yes I agree with what you said, and that's why I didn't say it is THE solution, but worth considering. Especially not knowing how exactly the OP was writing it is impossible to tell them what's the best solution so this was a suggestion worth considering? Was that why the -1, btw? – Sach Jul 31 '17 at 22:09