5

Overview of My Situation:

My task is to read strings from a file, and re-format them to a more useful format. After reformatting the input, I have to write it to a output file.

Here is an Example of what has to be done. Example of File Line :

ANO=2010;CPF=17834368168;YEARS=2010;2009;2008;2007;2006 <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2010</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2009</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2008</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2007</ANO><SITUACAODECLARACAO>Sua declaração consta como Pedido de Regularização(PR), na base de dados da Secretaria da Receita Federal do Brasil</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><RESTITUICAO><CPF>17834368168</CPF><ANO>2006</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>

This input file has on each line two important informations: CPF which is the document number I will use, and the XML file (that represents the return of a query for the document on a database).

What I Must Achieve:

Each Document, in this old format has an XML containing the query returns for all the years (2006 to 2010). After reformatting it, each input line is converted to 5 output lines :

CPF=17834368168;YEARS=2010; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2010</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2009; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2009</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2008; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2008</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2007; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2007</ANO><SITUACAODECLARACAO>Sua declaração consta como Pedido de Regularização(PR), na base de dados da Secretaria da Receita Federal do Brasil</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>
CPF=17834368168;YEARS=2006; <?xml version='1.0' encoding='ISO-8859-1'?><QUERY><RESTITUICAO><CPF>17834368168</CPF><ANO>2006</ANO><SITUACAODECLARACAO>Sua declaração não consta na base de dados da Receita Federal</SITUACAODECLARACAO><DATACONSULTA>05/01/2012</DATACONSULTA></RESTITUICAO><STATUS><RESULT>TRUE</RESULT><MESSAGE></MESSAGE></STATUS></QUERY>

One line, containing each year information about that document. So basically, the output files are 5 times longer than the input files.

Performance Issue:

Each file has 400,000 lines, and I have 133 files to process.

At the moment, here is the flow of my app :

  1. Open a file
  2. Read a line
  3. Parse it to the new format
  4. Write the line to the output file
  5. Goto 2 until there is no left line
  6. Goto1 until there is no left file

Each input file is about 700MB, and it is taking forever to read files and write the converted version of them to another one. A file with 400KB takes ~30 seconds to achieve the process.

Extra Information:

My machine runs on a Intel i5 processor, with 8GB RAM.

I am not instantiating tons of object to avoid mem. leaking, and I'm using the using clause on input file opening.

What can I do to make it run faster ?

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
Marcello Grechi Lins
  • 3,350
  • 8
  • 38
  • 72
  • 2
    What happens if you skip "3. Parse it to the new format" and just write the line as-is to the new file? If the performance improves, you've found the issue. If it doesn't, post the code that reads and writes the data. – Andrei Feb 24 '12 at 20:08
  • What is the size of the buffer you're writing to the file? – Matthew Feb 24 '12 at 20:08
  • 1
    Please show a code which processes a file in 30sec also it makes sense share a sample file so we can use the same data set for test purposes, but I believe if you would share code you are using it would be enough to see weak places – sll Feb 24 '12 at 20:08

4 Answers4

12

I don't know what your code looks like, but here's an example which on my box (admittedly with an SSD and an i7, but...) processes a 400K file in about 50ms.

I haven't even thought about optimizing it - I've written it in the cleanest way I could. (Note that it's all lazily evaluated; File.ReadLines and File.WriteAllLines take care of opening and closing the files.)

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;

class Test
{
    public static void Main()
    {
        Stopwatch stopwatch = Stopwatch.StartNew();
        var lines = from line in File.ReadLines("input.txt")
                    let cpf = ParseCpf(line)
                    let xml = ParseXml(line)
                    from year in ParseYears(line)
                    select cpf + year + xml;

        File.WriteAllLines("output.txt", lines);
        stopwatch.Stop();
        Console.WriteLine("Completed in {0}ms", stopwatch.ElapsedMilliseconds);
    }

    // Returns the CPF, in the form "CPF=xxxxxx;"
    static string ParseCpf(string line)
    {
        int start = line.IndexOf("CPF=");
        int end = line.IndexOf(";", start);
        // TODO: Validation
        return line.Substring(start, end + 1 - start);
    }

    // Returns a sequence of year values, in the form "YEAR=2010;"
    static IEnumerable<string> ParseYears(string line)
    {
        // First year.
        int start = line.IndexOf("YEARS=") + 6;
        int end = line.IndexOf(" ", start);
        // TODO: Validation
        string years = line.Substring(start, end - start);
        foreach (string year in years.Split(';'))
        {
            yield return "YEARS=" + year + ";";
        }
    }

    // Returns all the XML from the leading space onwards
    static string ParseXml(string line)
    {
        int start = line.IndexOf(" <?xml");
        // TODO: Validation
        return line.Substring(start);
    }
}
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    This is exactly what i´ve done. I learnt how to use IEnumerable to stream information between the reading file and the output one. This solved my life, for real. Processing was incredible fast. – Marcello Grechi Lins Feb 24 '12 at 20:59
5

This looks like a fine candidate for pipelining.

The basic idea is to have 3 concurrent Tasks, one for each "stage" in the pipeline, communicating with each other through queues (BlockingCollection):

  1. The first task reads the input file line-by-line and puts read lines into the queue.
  2. The second task gets the lines from the queue, formats them and puts the result in another queue.
  3. The third task gets the formatted results from the second queue and writes them to the resulting file.

Ideally, the task 1 should not wait for task 2 to finish before going to the next file.

You can even go berserk and put each individual file's pipeline into a separate parallel task, but that would trash your HDD's head so badly it'll probably hurt more than help. For an SSD, on the other hand, this might actually be justified - in any case measure before making a decision.

--- EDIT ---

Using John Skeet's single-threaded implementation as base, here is how a pipelined version would look like (working example):

class Test {

    struct Queue2Element {
        public string CPF;
        public List<string> Years;
        public string XML;
    }

    public static void Main() {

        Stopwatch stopwatch = Stopwatch.StartNew();

        var queue1 = new BlockingCollection<string>();
        var task1 = new Task(
            () => {
                foreach (var line in File.ReadLines("input.txt"))
                    queue1.Add(line);
                queue1.CompleteAdding();
            }
        );

        var queue2 = new BlockingCollection<Queue2Element>();
        var task2 = new Task(
            () => {
                foreach (var line in queue1.GetConsumingEnumerable())
                    queue2.Add(
                        new Queue2Element {
                            CPF = ParseCpf(line),
                            XML = ParseXml(line),
                            Years = ParseYears(line).ToList()
                        }
                    );
                queue2.CompleteAdding();
            }
        );

        var task3 = new Task(
            () => {
                var lines = 
                    from element in queue2.GetConsumingEnumerable()
                    from year in element.Years
                    select element.CPF + year + element.XML;
                File.WriteAllLines("output.txt", lines);
            }
        );

        task1.Start();
        task2.Start();
        task3.Start();
        Task.WaitAll(task1, task2, task3);

        stopwatch.Stop();
        Console.WriteLine("Completed in {0}ms", stopwatch.ElapsedMilliseconds);

    }

    // Returns the CPF, in the form "CPF=xxxxxx;"
    static string ParseCpf(string line) {
        int start = line.IndexOf("CPF=");
        int end = line.IndexOf(";", start);
        // TODO: Validation
        return line.Substring(start, end + 1 - start);
    }

    // Returns a sequence of year values, in the form "YEAR=2010;"
    static IEnumerable<string> ParseYears(string line) {
        // First year.
        int start = line.IndexOf("YEARS=") + 6;
        int end = line.IndexOf(" ", start);
        // TODO: Validation
        string years = line.Substring(start, end - start);
        foreach (string year in years.Split(';')) {
            yield return "YEARS=" + year + ";";
        }
    }

    // Returns all the XML from the leading space onwards
    static string ParseXml(string line) {
        int start = line.IndexOf(" <?xml");
        // TODO: Validation
        return line.Substring(start);
    }

}

As it turns out, the parallel version above is only slightly faster than the serial version. Apparently, the task is more I/O bound than anything else, so pipelining does not help much. If you increase the amount of processing (e.g. add a robust verification) this might change the situation in favor of parallelism, but for now you are probably best-off just concentrating on serial improvements (as John Skeet himself noted, the code is not as fast as it can be).

(Also, I tested with cached files - I wonder if there is a way to clear Windows file cache and see if hardware I/O queue of depth 2 would allow the hard disk to optimize head movements compared to I/O depth 1 of the serial version.)

Community
  • 1
  • 1
Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167
  • interesting idea, it would be interesting see working example, whether it makes a performance difference – sll Feb 24 '12 at 20:53
  • great, thanks! +1. Also thnaks for pipeline pattern MSDN reference – sll Feb 25 '12 at 13:21
  • Hey just found that you are from Novi Sad, just wonderig perhaps you've participated in Hard&&Soft competition in Suceava? :) I remember good team from Novi Sad (Microchip PIC controllers +Desctop soft) – sll Feb 25 '12 at 13:30
  • @sll Alas, no. It's been a long time since I was a student ;) – Branko Dimitrijevic Feb 25 '12 at 14:15
  • :) COmpetition I've mentioned were in 2003-2005 years, anyway :) – sll Feb 25 '12 at 16:57
2

It is definitely not the IO problem - check your processing, use profiler to know who and where holds all the timeslices.

Show your processing code, may be you use some inefficient string operations...

Oleg Dok
  • 21,109
  • 4
  • 45
  • 54
1

There are few basic things you can do right away...

  1. Run multiple threads so you process multiple files at the same time.
  2. Use StringBuilder or StringBuffer instead of string concat
  3. If you use XmlDocument to parse the XMLs replace it with XmlTextReader and XmlTextWriter
  4. Don't convert string to numbers and back to strings if you don't need really need it
  5. Remove every unnecessary string operation. For example don't do str.Contains just to do str.IndexOf on the next line. Instead call str.IndexOf store the result in local var and check if > 0.

Run different parts of your algorithm by themselves and measure time. Start with reading the whole file line by line and measure that. Write the same lines back to a new file and measure that. Split the prefix information from the xml and measure that. Parse the xml.... This way you will know what the bottleneck is and focus on that part.

Aleks
  • 1,177
  • 10
  • 21
  • I don't think you really need to be terribly smart here - see my answer for an example which runs very quickly without worrying too much about performance - just concentrating on clarity instead. – Jon Skeet Feb 24 '12 at 20:25
  • The first part of this answer (1-5 bullet points) focuses on wrong areas. Jon Skeet knew immediately that such a simple processing cannot take so long if programmed properly. But not everyone has Jon Skeet skills. So the second part of this answer, yet a generic piece of advice, is actually good for a programmer that is clueless about where they killed processing speed. Kind of more detailed version of "do your homework" of measuring performance, do it step by step. – Stéphane Gourichon Jan 13 '16 at 20:59