0

So the input for this program is a file of 447000 lines, each line is data that can be split in to a list of legnth 5.

It takes about 30 minutes to process in the current state, I had it as a non parallel foreach loop before but that took a long amount of time to process (I didn't get a time for how long it took though), im not actually sure if i've saved much time but making it run in parallel. So basically i'm not sure what operation is making it take so long to process, or how to find what one. I tried using Diagnostic tools but that wasn't very accurate, said everything was 1ms.

As far as I understand I think every operation im doing is O(1) so if that's true then is it running as fast as it can then?

double bytesTransferred = 0;
double firstTimeValue = 0;
double lastTimeValue = 0;
double totalTimeValue = 0;
var dataAsList =  data.ToList();
var justTheTimeDifferences = new ConcurrentBag<double>();
var senderHostsHash = new ConcurrentBag<double>();
var receiverHostsHash = new ConcurrentBag<double>();
var sourcePortsHash = new ConcurrentBag<double>();
var destinationPortsHash = new ConcurrentBag<double>();
int lastPosition = dataAsList.Count;

 Parallel.ForEach(dataAsList, item =>
{
    var currentIndex = dataAsList.IndexOf(item);
    Console.WriteLine($"{currentIndex}/{lastPosition}");

    var itemAsList = item.Split(' ');
    if (dataAsList.IndexOf(item) == 0)
    {
        bytesTransferred += Convert.ToDouble(itemAsList[5]);
        return;
    }

    if (currentIndex == lastPosition - 1)
    {
        lastTimeValue = Convert.ToDouble(itemAsList[0]);
        totalTimeValue = lastTimeValue - firstTimeValue;
    }

    bytesTransferred += Convert.ToDouble(itemAsList[5]);
    var currentTime = Convert.ToDouble(itemAsList[0]);
    var lastEntry = dataAsList[currentIndex - 1];

    justTheTimeDifferences.Add(currentTime - Convert.ToDouble(lastEntry.Split(' ')[0]));
    senderHostsHash.Add(Convert.ToDouble(itemAsList[1]));
    receiverHostsHash.Add(Convert.ToDouble(itemAsList[2]));
    sourcePortsHash.Add(Convert.ToDouble(itemAsList[3]));
    destinationPortsHash.Add(Convert.ToDouble(itemAsList[4]));
});

An example input would be:

0.000000 1 2 23 2436 1  
0.010445 2 1 2436 23 2  
0.023775 1 2 23 2436 2  
0.026558 2 1 2436 23 1  
0.029002 3 4 3930 119 42  
0.032439 4 3 119 3930 15  
0.049618 1 2 23 2436 1 

To add some more information I am running this on my desktop with a 4 core CPU running at 4GHz, the information is being read off disk which is a SSD; 0% disk usage in Task manager when running it. I have also dropped off Console.ReadLine and running it again now, then will do some stopwatch benchmarks

Solution:
It was the IndexOf lookup that was causing the huge run time, changed it all to Parallel.For and only takes about 1.25 seconds to process

Toxicable
  • 1,639
  • 2
  • 21
  • 29
  • 2
    I dont know if this would make a huge difference, but, you do an indexof - why not do a parallel for loop rather than foreach.. then you already have the index. it would save a look up – BugFinder Apr 27 '16 at 07:29
  • 1
    I'm not sure but I believe that this should fit better on codereview.stackexchange.com – Matías Fidemraizer Apr 27 '16 at 07:29
  • @MatíasFidemraizer oh right, I think you're correct. I havn't sued it before, will have a look – Toxicable Apr 27 '16 at 07:30
  • 1
    I will point out that your `bytesTransferred` variable is being updated by multiple threads in an uncontrolled manner, so it's unlikely it will contain a *correct* result at the end, but that's not what you're asking for here. – Damien_The_Unbeliever Apr 27 '16 at 07:31
  • 2
    Try using a StopWatch class to do timing on sections of your code and try and see where it is slow. http://stackoverflow.com/a/969303/847962. Also, in my experience writing to the console can be slow so it might be an idea to remove this from in side your loop, or at least only show details every xx iterations. – jason.kaisersmith Apr 27 '16 at 07:32
  • No, this is not going to be a good fit for codereview.se, and to some extent for the same reason it's not a good question here: it's incomplete. The `Convert` class isn't the best choice for parsing strings, but even with all those calls to `Convert` and the call to `Console.WriteLine()` (also slow), there's nothing in the code you've posted that would explain taking 30 minutes to process a mere ~44K lines of text. Without more information, the most likely cause is slow I/O (e.g. reading over a really slow network). Please improve the question by providing a good [mcve], if you want help. – Peter Duniho Apr 27 '16 at 07:33
  • 2
    @jason.kaisersmith - oh, I'd missed the `Console` line. That'll definitely be a killer (because it takes an exclusive lock to perform the write). – Damien_The_Unbeliever Apr 27 '16 at 07:33
  • When running the program use windows diagnostic tools to see what is your bottleneck. Is it I/O (disk) or CPU Core. If it is I/O then you need faster disks. If it is a single CPU core then you need to multi-thread, if it is all CPU Cores then you need a faster/more CPU! – jason.kaisersmith Apr 27 '16 at 07:34
  • @PeterDuniho Sorry, dropped off a 0 there.. it's 440k lines – Toxicable Apr 27 '16 at 07:34
  • Okay, well that's marginally more plausible, but there's still not enough information here. – Peter Duniho Apr 27 '16 at 07:35
  • @PeterDuniho Aight, ill add a bit more to the post – Toxicable Apr 27 '16 at 07:36
  • What version of .NET are you using? ConcurrentBag has had some serious performance issues in the past. – Jakob Olsen Apr 27 '16 at 07:41
  • @JakobOlsen Im using 4.5.2 for this – Toxicable Apr 27 '16 at 07:41
  • Even so. Can you please try to run your code without actually putting anything in the ConcurrentBags? Just comment out the last 5 lines of the foreach loop. Just to see what difference it makes. – Jakob Olsen Apr 27 '16 at 07:45
  • Also; You should do an `Interlocked.Add` on bytesTransferred if you want to get the correct number in that. – Jakob Olsen Apr 27 '16 at 07:47
  • @JakobOlsen Can I not add onto a double from multiple threads? ie is it not thread safe? – Toxicable Apr 27 '16 at 07:49
  • 1
    `+=` isn't *atomic*. So your thread reads the variable, adds its own contribution and then overwrites the variable with the new value. If another thread wrote to the variable in the meantime, that update gets lost. – Damien_The_Unbeliever Apr 27 '16 at 07:57
  • @Damien_The_Unbeliever Ahh that makes sense thanks, I havn't heard of atomic, ill look it up now though – Toxicable Apr 27 '16 at 07:58

1 Answers1

1

As pointed above, Indexof in the loop makes your algo O(n2), which is bad...

You should consider a simple parallel.for on an array.

Regis Portalez
  • 4,675
  • 1
  • 29
  • 41