10

Problem: Very very, large file I need to parse line by line to get 3 values from each line. Everything works but it takes a long time to parse through the whole file. Is it possible to do this within seconds? Typical time its taking is between 1 minute and 2 minutes.

Example file size is 148,208KB

I am using regex to parse through every line:

Here is my c# code:

private static void ReadTheLines(int max, Responder rp, string inputFile)
{
    List<int> rate = new List<int>();
    double counter = 1;
    try
    {
        using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 1024))
        {
            string line;
            Console.WriteLine("Reading....");
            while ((line = sr.ReadLine()) != null)
            {
                if (counter <= max)
                {
                    counter++;
                    rate = rp.GetRateLine(line);
                }
                else if (max == 0)
                {
                    counter++;
                    rate = rp.GetRateLine(line);
                }
            }
            rp.GetRate(rate);
            Console.ReadLine();
        }
    }
    catch (Exception e)
    {
        Console.WriteLine("The file could not be read:");
        Console.WriteLine(e.Message);
    }
}

Here is my regex:

public List<int> GetRateLine(string justALine)
{
    const string reg = @"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$";
    Match match = Regex.Match(justALine, reg,
                                RegexOptions.IgnoreCase);

    // Here we check the Match instance.
    if (match.Success)
    {
        // Finally, we get the Group value and display it.

        string theRate = match.Groups[3].Value;
        Ratestorage.Add(Convert.ToInt32(theRate));
    }
    else
    {
        Ratestorage.Add(0);
    }
    return Ratestorage;
}

Here is an example line to parse, usually around 200,000 lines:

10.10.10.10 - - [27/Nov/2002:16:46:20 -0500] "GET /solr/ HTTP/1.1" 200 4926 789

Ria
  • 10,237
  • 3
  • 33
  • 60
Rayshawn
  • 2,603
  • 3
  • 27
  • 46

4 Answers4

16

Memory Mapped Files and Task Parallel Library for help.

  1. Create persisted MMF with multiple random access views. Each view corresponds to a particular part of a file
  2. Define parsing method with parameter like IEnumerable<string>, basically to abstract a set of not parsed lines
  3. Create and start one TPL task per one MMF view with Parse(IEnumerable<string>) as a Task action
  4. Each of worker tasks adds a parsed data into the shared queue of BlockingCollection type
  5. An other Task listen to BC (GetConsumingEnumerable()) and processes all data which already parsed by worker Tasks

See Pipelines pattern on MSDN

Must say this solution is for .NET Framework >=4

sll
  • 61,540
  • 22
  • 104
  • 156
5

Right now, you recreate your Regex each time you call GetRateLine, which occurs every time you read a line.

If you create a Regex instance once in advance, and then use the non-static Match method, you will save on regex compilation time, which could potentially give you a speed gain.

That being said, it will likely not take you from minutes to seconds...

Reed Copsey
  • 554,122
  • 78
  • 1,158
  • 1,373
2

At a brief glance there are a few things I would try...

First, Increase your file stream buffer to at least 64kb:

using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 65536))

Second, Construct the Regex once instead of using a string inside the loop:

static readonly Regex rateExpression = new Regex(@"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$", RegexOptions.IgnoreCase);
//In GetRateLine() change to:
Match match = rateExpression.Match(justALine);

Third, Use a single list instance by having Responder.GetRate() return a list or array.

// replace: 'rp.GetRate(rate)', with:
rate = rp.GetRate();

I would preallocate the list to a 'reasonable' limit:

List<int> rate = new List<int>(10000);

You might also consider changing your encoding from UTF-8 to ASCII if available and applicable to your specific needs.

Comments

Generally, if this is really going to be a requirement to get the parse time down, you are going to want to build a tokenizer and skip Regex entirely. Since your input format looks to be all ascii and fairly simple this should be easy enough to do, but probably a little more brittle than regex. In the end you will need to weigh and balance the need for speed vs the reliability and maintainability of the code.

If you need some example by-hand parsing look at the answer to this question

Community
  • 1
  • 1
csharptest.net
  • 62,602
  • 11
  • 71
  • 89
1

Instead of recreating a regex for each call to GetRateLine, create it in advance, passing the RegexOptions.Compiled option to the Regex(String,RegexOptions) constructor.

You may also want to try reading in the entire file to memory, but I doubt that's your bottleneck. It shouldn't take a minute to read in ~100MB from disk.

ceyko
  • 4,822
  • 1
  • 18
  • 23