Is there a fast way to parse through a large file with regex?

Question

Problem: Very very, large file I need to parse line by line to get 3 values from each line. Everything works but it takes a long time to parse through the whole file. Is it possible to do this within seconds? Typical time its taking is between 1 minute and 2 minutes.

Example file size is 148,208KB

I am using regex to parse through every line:

Here is my c# code:

private static void ReadTheLines(int max, Responder rp, string inputFile)
{
    List<int> rate = new List<int>();
    double counter = 1;
    try
    {
        using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 1024))
        {
            string line;
            Console.WriteLine("Reading....");
            while ((line = sr.ReadLine()) != null)
            {
                if (counter <= max)
                {
                    counter++;
                    rate = rp.GetRateLine(line);
                }
                else if (max == 0)
                {
                    counter++;
                    rate = rp.GetRateLine(line);
                }
            }
            rp.GetRate(rate);
            Console.ReadLine();
        }
    }
    catch (Exception e)
    {
        Console.WriteLine("The file could not be read:");
        Console.WriteLine(e.Message);
    }
}

Here is my regex:

public List<int> GetRateLine(string justALine)
{
    const string reg = @"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$";
    Match match = Regex.Match(justALine, reg,
                                RegexOptions.IgnoreCase);

    // Here we check the Match instance.
    if (match.Success)
    {
        // Finally, we get the Group value and display it.

        string theRate = match.Groups[3].Value;
        Ratestorage.Add(Convert.ToInt32(theRate));
    }
    else
    {
        Ratestorage.Add(0);
    }
    return Ratestorage;
}

Here is an example line to parse, usually around 200,000 lines:

10.10.10.10 - - [27/Nov/2002:16:46:20 -0500] "GET /solr/ HTTP/1.1" 200 4926 789

I'm not really an expert, but I don't see anything out of place. — Almo, Dec 10 '12 at 22:54
short answer: no, you can't parse every line of 150 mb of data in just a few seconds — Sam I am says Reinstate Monica, Dec 10 '12 at 22:55
Yeah, that's what I thought too, but was not sure if I was just not smart enough to think of some big O notation to make this faster. — Rayshawn, Dec 10 '12 at 22:56
Try compiling your regex first? `Regex myregex= new Regex(@"....", RegexOptions.Compiled); myregex.Match(etc);` — Michael Dunlap, Dec 10 '12 at 22:58
you are read IO line by line which is slow. you should load text once for all if possible. if not, load a big chunk of it, say 10MB every time. Many cases, the IO is the bottleneck. — urlreader, Dec 10 '12 at 23:04
I will try a few suggestion below and check the best one for performance. — Rayshawn, Dec 10 '12 at 23:05
@RayEatmon, 150MB is not a very very big file, yes it's big but you will find much larger files. — Ash Burlaczenko, Dec 10 '12 at 23:14

sll · Accepted Answer · 2012-12-10T23:02:47.060

Memory Mapped Files and Task Parallel Library for help.

Create persisted MMF with multiple random access views. Each view corresponds to a particular part of a file
Define parsing method with parameter like IEnumerable<string>, basically to abstract a set of not parsed lines
Create and start one TPL task per one MMF view with Parse(IEnumerable<string>) as a Task action
Each of worker tasks adds a parsed data into the shared queue of BlockingCollection type
An other Task listen to BC (GetConsumingEnumerable()) and processes all data which already parsed by worker Tasks

See Pipelines pattern on MSDN

Must say this solution is for .NET Framework >=4

score 5 · Answer 2 · answered Dec 10 '12 at 22:56

Right now, you recreate your Regex each time you call GetRateLine, which occurs every time you read a line.

If you create a Regex instance once in advance, and then use the non-static Match method, you will save on regex compilation time, which could potentially give you a speed gain.

That being said, it will likely not take you from minutes to seconds...

score 2 · Answer 3 · edited May 23 '17 at 10:29

At a brief glance there are a few things I would try...

First, Increase your file stream buffer to at least 64kb:

using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 65536))

Second, Construct the Regex once instead of using a string inside the loop:

static readonly Regex rateExpression = new Regex(@"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$", RegexOptions.IgnoreCase);
//In GetRateLine() change to:
Match match = rateExpression.Match(justALine);

Third, Use a single list instance by having Responder.GetRate() return a list or array.

// replace: 'rp.GetRate(rate)', with:
rate = rp.GetRate();

I would preallocate the list to a 'reasonable' limit:

List<int> rate = new List<int>(10000);

You might also consider changing your encoding from UTF-8 to ASCII if available and applicable to your specific needs.

Comments

Generally, if this is really going to be a requirement to get the parse time down, you are going to want to build a tokenizer and skip Regex entirely. Since your input format looks to be all ascii and fairly simple this should be easy enough to do, but probably a little more brittle than regex. In the end you will need to weigh and balance the need for speed vs the reliability and maintainability of the code.

If you need some example by-hand parsing look at the answer to this question

score 1 · Answer 4 · answered Dec 10 '12 at 23:03

Instead of recreating a regex for each call to GetRateLine, create it in advance, passing the RegexOptions.Compiled option to the Regex(String,RegexOptions) constructor.

You may also want to try reading in the entire file to memory, but I doubt that's your bottleneck. It shouldn't take a minute to read in ~100MB from disk.

Is there a fast way to parse through a large file with regex?

4 Answers4

Linked