0

I have testFile.txt file (around 400mg). It contains OHLC stock prices with timeframe of 1 minute.

The structure of it: "stock name, date, time, price open, price high, price low, price close, volume"->"OTHE,20010102,230100,1.9007,1.9007,1.9007,1.9007,4" (it's just example).

My major problem - this code very slow. I measured the speed and found that the critical part is double.Parse part. Is it possible to change the code to increase performance? My c# parsing code:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using System.Globalization;

namespace ConsoleApplication3
{
    class Program
    {
        static void Main(string[] args)
        {
            string sourceDir = "D:\\testFile.txt",
                   outDir = "D:\\result.txt";
            Thread.CurrentThread.CurrentCulture = System.Globalization.CultureInfo.InvariantCulture;

            using (StreamReader sr = new StreamReader(sourceDir))
            {
                int divider = 5;
                string line = sr.ReadLine();
                StreamWriter sw = new StreamWriter(outDir);

                List<string> listLine = new List<string>();
                List<double> listOpen = new List<double>();
                List<double> listHigh = new List<double>();
                List<double> listLow = new List<double>();
                List<double> listClose = new List<double>();
                List<double> listVolume = new List<double>();
                DateTime dateTimeOut = new DateTime();
                string formatDate = "yyyyMMddHHmmss";
                string newLine = "";
                double priceOpen, priceHigh, priceLow, priceClose, volume;

                //read first line, but don't write it
                line = sr.ReadLine();

                while (line != null)
                {
                    listLine = line.Split(',').ToList();
                    dateTimeOut = DateTime.ParseExact(listLine[1] + listLine[2], formatDate, null);

                    double.TryParse(listLine[3], out priceOpen);
                    double.TryParse(listLine[4], out priceHigh);
                    double.TryParse(listLine[5], out priceLow);
                    double.TryParse(listLine[6], out priceClose);
                    double.TryParse(listLine[7], out volume);

                    listOpen.Add(priceOpen);
                    listHigh.Add(priceHigh);
                    listLow.Add(priceLow);
                    listClose.Add(priceClose);
                    listVolume.Add(volume);

                    if (dateTimeOut.Minute % divider == 0)
                    {
                        newLine = dateTimeOut + "," + listOpen[0] + "," + listHigh.Max() + "," + listLow.Min() + "," + listClose[4] + "," + listVolume.Max();
                        sw.WriteLine(newLine);
                    }
                    line = sr.ReadLine();
                }
                sr.Close();
            }
        }
    }
}

Upd. The problem is here:

                        if (dateTimeOut.Minute % divider == 0)
                        {
                            newLine = "";
                            sw.WriteLine(newLine);
                        }
  • Are you displaying or calculating based on the data in real time? – Keith Payne Feb 05 '14 at 14:33
  • It's pretty easy - use a CSV parser. – Pierre-Luc Pineault Feb 05 '14 at 14:35
  • How many lines in that file? – Kris Krause Feb 05 '14 at 14:35
  • 5
    Can you define "very slow" please? How did you measure the speed, and under what conditions? I note that you're ignoring the return value of `double.TryParse`, by the way - *and* you're using `double` for prices rather than `decimal`, which is more worrying to me than performance... – Jon Skeet Feb 05 '14 at 14:36
  • No I need just to convert it. I don't need it in real time. CSV parser? – user3245303 Feb 05 '14 at 14:36
  • 2
    Oh, and wouldn't it be more sensible to have *one* list, with a type which contains the open/high/low/close/volume values for a single line? – Jon Skeet Feb 05 '14 at 14:37
  • ((Unrelated, but you are silently replacing all badly formatted numbers with zero. So "FRED" will get turned into "0.0")) – Matthew Watson Feb 05 '14 at 14:37
  • 2
    `line.Split(',').ToList()` doesn't make sense to me in performance-critical code. `string.Split` returns an array, which you can already use to access the fields by index. –  Feb 05 '14 at 14:38
  • @JonSkeet I used simple StopWatch. Very slow - it means that for 400mg source file program convert all data in 2 hours. – user3245303 Feb 05 '14 at 14:38
  • You can speed up double.Parse() by about 10% according to this post: http://stackoverflow.com/questions/8457934/faster-alternative-to-convert-todouble – Matthew Watson Feb 05 '14 at 14:39
  • With line.Split - all ok) – user3245303 Feb 05 '14 at 14:40
  • @MatthewWatson: 10% doesn't mean nothing) In any case - thanks – user3245303 Feb 05 '14 at 14:41
  • 400MB source file is how many lines to parse? – Roy Dictus Feb 05 '14 at 14:41
  • See [this answer](http://stackoverflow.com/a/2081425/2316200) for a library suggestion. Also if all you need is a review, you can post it on [Code Review](http://codereview.stackexchange.com/) instead. – Pierre-Luc Pineault Feb 05 '14 at 14:41
  • Given your code, I can't see why you need to convert to double, because you are not using the resulting doubles as doubles, but merely converting them back to strings to write to your output file. Removing this double conversion (keep everthing as – Polyfun Feb 05 '14 at 14:43
  • @RoyDictus 11291300 lines – user3245303 Feb 05 '14 at 14:44
  • @ShellShock you aren't right – user3245303 Feb 05 '14 at 14:44
  • I get it now. List.Min and Max. – Polyfun Feb 05 '14 at 14:51
  • 1
    @user3245303: Running under the debugger, or not? That sounds much slower than I'd expect, and I *suspect* your methodology is flawed given that estimate. Is that sample line a realistic sample line? (So we can perform similar benchmarks.) If you could provide a short but complete program you're using for benchmarking, that would be very helpful. – Jon Skeet Feb 05 '14 at 14:52
  • I see you updated the code, but where is the `Stopwatch`? – Matthew Watson Feb 05 '14 at 14:58
  • Thanks for all advices! – user3245303 Feb 05 '14 at 15:07
  • 11.3 million lines over 2 hours is approximately 1570 lines per second. That's not fantastic but also not too shabby, taking into account that you have to read the file and that you're also parsing dates. And you can indeed use a CSV parser library such as http://www.filehelpers.com. – Roy Dictus Feb 05 '14 at 15:30

4 Answers4

1

I do not believe that the Double.Parse() is the bottleneck.

I wrote a test program (shown below). The release build parses one hundred million doubles in less than twenty seconds:

using System;
using System.Diagnostics;

namespace Demo
{
    internal class Program
    {   
        private void run()
        {
            string s = "12345.6789";
            double result;
            Stopwatch sw = Stopwatch.StartNew();

            for (int i = 0; i < 100000000; ++i)
                double.TryParse(s, out result);

            Console.WriteLine("Took " + sw.Elapsed);
        }

        private static void Main()
        {
            new Program().run();
        }
    }
}
Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
1

You are using LINQ Max() and Min() functions which iterates through the whole collection. Since they are called thousands of times in a loop, and collection contains millions of elements, it's very inefficient. Instead store min and max values outside the loop and update them on every iteration:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using System.Globalization;

namespace ConsoleApplication3
{
    class Program
    {
        static void Main(string[] args)
        {
            string sourceDir = "D:\\testFile.txt",
                   outDir = "D:\\result.txt";
            Thread.CurrentThread.CurrentCulture = System.Globalization.CultureInfo.InvariantCulture;

            using (StreamReader sr = new StreamReader(sourceDir))
            {
                int divider = 5;
                string line = sr.ReadLine();
                StreamWriter sw = new StreamWriter(outDir);

                List<string> listLine = new List<string>();
                List<double> listOpen = new List<double>();
                List<double> listHigh = new List<double>();
                List<double> listLow = new List<double>();
                List<double> listClose = new List<double>();
                List<double> listVolume = new List<double>();
                DateTime dateTimeOut = new DateTime();
                string formatDate = "yyyyMMddHHmmss";
                string newLine = "";
                double priceOpen, priceHigh, priceLow, priceClose, volume;

                //read first line, but don't write it
                line = sr.ReadLine();

                double highMax = double.MinValue;
                double lowMin = double.MaxValue;
                double volumeMax = double.MinValue;

                while (line != null)
                {
                    listLine = line.Split(',').ToList();
                    dateTimeOut = DateTime.ParseExact(listLine[1] + listLine[2], formatDate, null);

                    double.TryParse(listLine[3], out priceOpen);
                    double.TryParse(listLine[4], out priceHigh);
                    double.TryParse(listLine[5], out priceLow);
                    double.TryParse(listLine[6], out priceClose);
                    double.TryParse(listLine[7], out volume);

                    listOpen.Add(priceOpen);
                    listHigh.Add(priceHigh);
                    listLow.Add(priceLow);
                    listClose.Add(priceClose);
                    listVolume.Add(volume);

                    /*Here is implementation of accumulative max/min calculation*/
                    if (highMax < priceHigh)
                    {
                        highMax = priceHigh;
                    }

                    if (lowMin > priceLow)
                    {
                        lowMin = priceLow;
                    }

                    if (volumeMax < volume)
                    {
                        volumeMax = volume;
                    }

                    if (dateTimeOut.Minute % divider == 0)
                    {
                        newLine = dateTimeOut + "," + listOpen[0] + "," + highMax + "," + lowMin + "," + listClose[4] + "," + volumeMax;
                        sw.WriteLine(newLine);
                    }
                    line = sr.ReadLine();
                }
                sr.Close();
            }
        }
    }
}

In this case you even don't need to add parsed values to lists (if you don't have other usages of them), so you can remove lists completely, further saving some memory and time.

Sasha
  • 8,537
  • 4
  • 49
  • 76
0

double.Parse is very slow because there is a lot of ways to represent double values: 1000; 1000.1; 1e3, 1.353e+34, -23.24e-123 etc. If you have only one predefined format (and it is likely you have), say 10394.324 without exponensial form support, then you can implement much more efficient custom parser: read character by character from stream, check if it is space, digit or dot and accumulare result or handle the result correspondingly. It is relatively simple to implement and will provide much better performance. I suppose 400MB file can be parsed in less than 10 seconds if your hard drive will allow to read so fast =).

Also I wouldn't recommend using string.Split with such a big amount of strings - it will consume all your memory and make garbage collections to occur often, which probably will slow down your code even more than double.Parse. Instread read stream byte by byte.

One more point to mention is ToList() creates new list and copies (references to) all elements of source collection into it. That is also significant time and memory-consuming unneeded operation.

And finally string concatenation shouldn't be done using '+' operator.

So i think your problem may be in this lines:

line.Split(',').ToList();
newLine = dateTimeOut + "," + listOpen[0] + "," + listHigh.Max() + "," + listLow.Min() + "," + listClose[4] + "," + listVolume.Max();

If running your program consumes all machine memory, then 99% that the problem is here.

Try replacing second line with few consequent calls to sw.Write(); to mitigate '+' operators and implement streaming double parser which won't require string splitting.

Sasha
  • 8,537
  • 4
  • 49
  • 76
  • 1
    But it's not "very slow" - you can parse 100,000,000 strings of the form `"12345.6789"` in less than 20 seconds, which isn't slow. – Matthew Watson Feb 05 '14 at 14:51
  • It sounds interesting. I will try. But it will give only 10% http://stackoverflow.com/questions/8457934/faster-alternative-to-convert-todouble – user3245303 Feb 05 '14 at 14:52
  • I don't agree you can't get more than 10%. The link you provided uses same double.Parse, just with specific format info. Good implementation can speed up parsing in few times. But I agree with @MatthewWatson that your problem isn't in double.Parse. His code looks convincingly. – Sasha Feb 05 '14 at 14:59
  • 1
    Using string concatenation when the number of strings is fixed at compile time is not a problem at all. No intermediate strings are created, as `string.Concat` can easily compute the size of the final string before needing to concatenate any of the values. – Servy Feb 05 '14 at 15:37
0

The problem was with List<>. I did stupid mistake. I almost forgot about List.Clear(). ))) So, thanks for all, especially to Oleksandr and Matthew.