1

I have to develop a utility that accepts path of a folder containing multiple log/text files of around 200 MB each and then traverse through all files to pick four elements from the lines where they exist.

I have tried multiple solutions, All solutions are working perfectly fine for smaller files but when i load bigger file the Windows Form just hangs or it shows "OutOfMemory Exception". Please help

Solution 1:

string textFile;
string re1 = "((?:2|1)\\d{3}(?:-|\\/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|\\/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|\\s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
if (!string.IsNullOrWhiteSpace(fbd.SelectedPath))
{
    string[] files = Directory.GetFiles(fbd.SelectedPath);

    System.Windows.Forms.MessageBox.Show("Files found: " + files.Length.ToString(), "Message");
    foreach (string fileName in files)
    {
        textFile = File.ReadAllText(fileName); 

        MatchCollection mc = Regex.Matches(textFile, re1);
        foreach (Match m in mc)
        {
            string a = m.ToString();
            Path.Text += a; //Temporary, Just to check the output
            Path.Text += Environment.NewLine;
        }

    }

}

Soltuion 2:

string re1 = "((?:2|1)\\d{3}(?:-|\\/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|\\/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|\\s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
foreach (string file in System.IO.Directory.GetFiles(fbd.SelectedPath))
{

    const Int32 BufferSize = 512;
    using (var fileStream = File.OpenRead(file))
    using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))


    {
        String line;
        while ((line = streamReader.ReadLine()) != null)
        {
            MatchCollection mc = Regex.Matches(line, re1);
            foreach (Match m in mc)
            {
                string a = m.ToString();
                Path.Text += a; //Temporary, Just to check the output
                Path.Text += Environment.NewLine;
            }
       }  
}

Solution 3:

string re1 = "((?:2|1)\\d{3}(?:-|\\/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|\\/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|\\s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
using (StreamReader r = new StreamReader(file))
{

    try
    {
        string line = String.Empty;

        while (!r.EndOfStream)
        {
            line = r.ReadLine();
            MatchCollection mc = Regex.Matches(line, re1);
            foreach (Match m in mc)
            {
                string a = m.ToString();
                Path.Text += a; //Temporary, Just to check the output
                Path.Text += Environment.NewLine;
            }

        }
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }
}
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
Shahzad
  • 11
  • 3
  • What windows (Vista/7/8/10), verision (32/64bit) and size of RAM – Michał M Jun 28 '16 at 10:43
  • Tested on: Windows 10 64 BIT, 4 GB RAM, Core i5 – Shahzad Jun 28 '16 at 10:50
  • 1
    Mayby You will find solution in this article: http://stackoverflow.com/questions/14186256/net-out-of-memory-exception-used-1-3gb-but-have-16gb-installed. – Michał M Jun 28 '16 at 11:01
  • @Michal Tried that as well but its same – Shahzad Jun 28 '16 at 11:16
  • .NET Framework has a hard limit of 2 GB for object size, minus the overhead consumed by the framework itself. Is there any possibilty that you could split this large log file into few smaller? – Michał M Jun 28 '16 at 11:35
  • I have experimented. It is processing 2065 Lines in 22 seconds 3500 Lines in 1:31:95 5000 Lines in 3:38:80 Which means the program is working but its really slow. I don't think so 100-200 MB is a very big size that needs to be split? – Shahzad Jun 28 '16 at 11:59
  • Depends.. For a file it is not very big size, but for a TXT file its pretty large. I won't to play an big expert here - You have dig deeper if you want to solve it without splitting. Sorry for not helping in solving this. – Michał M Jun 28 '16 at 12:14
  • If you want your code to run faster and you don't need to check if the date is *somewhat* valid, you can change the regex to `\d{4}(?:-\d{2}){2}T\d{2}(?::\d{2}){2}` and don't use string concatenation but a string builder – Thomas Ayoub Jun 28 '16 at 13:20
  • I think the code snippets reading file line by line are ok. String concatenation is a problem when the string gets bigger and bigger. What is exactly the `Path.Text`? As for the regex, declare it as private static readonly and use the RegexOptions.Compiled flag. As for the pattern itself, turn unnecessary noncapturing groups that match one symbol with character classes. When wofking with large files, always read by line, mind what you extract and how many extracted elements you plan to get (more than 200,000 elements can crash the app). – Wiktor Stribiżew Jun 28 '16 at 18:38
  • The situation is much better now. Its reading 10,000 Lines file within 5.98 Seconds. Previously it was taking 8:16:29. I just changed some Regex Patterns. Is 5.98 Seconds a good time for 10,000 lines? Actually i have to process more than 5 lac lines – Shahzad Jun 28 '16 at 22:20
  • (1) the question says extract 4 items, but that is not done. Is the 4 items relevant to the question you ask? (2) What is `Path.Text`? How big does it get? All those `Path.Text += ...` will consume memory and do a lot of text copying. (3) The regular expression is very complex. Choices between two character are better done as classes, so change `(?:2|1)`, `(?:-|\/)` and `(?:T|\s)` to `[21]`, `[-/]` and `[T\s]`. – AdrianHHH Jun 29 '16 at 08:33
  • Thank You everyone now i can Process 6 Lac records in 1 Minute and 27 Seconds. But willing to speed it up further. Is it a good time? One of my friend is saying it should take 30 seconds max for 6 lac records? – Shahzad Jun 29 '16 at 10:54

2 Answers2

0

Few things should be taken care of

  1. Don't append to string Path.Text += .... I am assuming that is just a test code and hopefully should just get thrown out
  2. You can just use the simple File.ReadLines call with no practical difference in file reading speed for your case
  3. You should compile your Regex
  4. You can try to simplify your regex
  5. You can add simple string based pre-checks before doing regex matches

Below is a sample code to implement the above guidelines

string re1 = "((?:2|1)\\d{3}(?:-|\\/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|\\/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|\\s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
var buf = new List<string>();
var re2 = new Regex(re1, RegexOptions.Compiled);

FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
foreach (string file in System.IO.Directory.GetFiles(fbd.SelectedPath)) {

    foreach (var line in File.ReadLines(file)) {
        if ((indx = line.IndexOf('-')) == -1 || line.IndexOf(':', indx + 1) == -1)
            continue;

        MatchCollection mc = re2.Matches(line);
        foreach (Match m in mc) {
            string a = m.ToString();
            buf.Add(a + Environment.NewLine); //Temporary, Just to check the output
        }
    }
}
Vikhram
  • 4,294
  • 1
  • 20
  • 32
0

Your "Path" debug may be concatenating a ton of string litters. Change it to StringBuilder instead of += concatenation to see if that is the cause of your memory issue

Have up looked at MS Log Parser 2.2 for an alternate approach?

Steve
  • 1,995
  • 2
  • 16
  • 25