3

im trying to read in a log file in c# thats huge - approx 300mbs of raw text data. ive been testing my program on smaller files approx 1mb which stores all log messages into a string[] array and searching with contains.

however that is too slow and takes up too much memory, i will never be able to process the 300mb log file. i need a way to grep the file, which quickly filters through it finding useful data and printing the line of log information corresponding to the search.

the big question is scale, i think 300mb will be my max, but need my program to handle it. what functions, data structions, searching can i use that will scale well with speed and efficiency to read a log file that big

Teddy
  • 79
  • 1
  • 8
  • I'm sure there will be ways to run the grep faster, however before you do that, are you able to pre-filter the logs with faster string comparison checks before grepping? – Ben Graham Oct 05 '12 at 03:52
  • Are you using grep or are you writing a program to do so? You may consider processing line by line instead of reading the whole file. A more complicated but will operate independent of the line length is that you read certain number of characters at a time and process it (tricky implementation, though). – nhahtdh Oct 05 '12 at 03:52
  • If you can use .NET 4, see [this question](http://stackoverflow.com/questions/4273699/how-to-read-a-large-1-gb-txt-file-in-net) recommending StreamReader or MemoryMappedFile. – Sumo Oct 05 '12 at 03:52
  • Did you measured what takes most of the time? Is it reading file? Searching? Garbage collection? – Alexei Levenkov Oct 05 '12 at 04:49
  • the log file is so big, i want to read the file, search it for tags im looking for. i want to display on a hit and ignore elsewise. i was storing them in a listbox view (array). i have to find a better way to write the info to c# that doesnt use as much memory because i think listbox is inherently an array – Teddy Oct 05 '12 at 14:09

1 Answers1

7

File.ReadLines is probably your best bet as it gives you an IEnumerable of lines of the text file and reads them lazily as you iterate over the IEnumerable. You can then use whatever method for searching the line you'd like to use (Regex, Contains, etc) and do something with it. My example below spawns a thread to search the line and output it to the console, but you can do just about anything. Of course, TEST, TEST, TEST on large files to see your performance mileage. I imagine if each individual thread spawned below takes too long, you can run into a thread limit.

IEnumerable<string> lines = File.ReadLines("myLargeFile.txt");
foreach (string line in lines) {
    string lineInt = line;
    (new Thread(() => {
        if (lineInt.Contains(keyword)) {
            Console.WriteLine(lineInt);
        }
    })).Start();
}

EDIT: Through my own testing, this is obviously faster:

foreach (string lineInt in File.ReadLines("myLargeFile.txt").Where(lineInt => lineInt.Contains(keyword))) {
    Console.WriteLine(lineInt);
}
Sumo
  • 4,066
  • 23
  • 40
  • Sumo, so File.ReadLines() does not read the entire file into memory at once? – Jonathan Henson Oct 05 '12 at 04:19
  • No, it simply gives you an iterator which will yield a single line from the file as you iterate the IEnumerable. – Sumo Oct 05 '12 at 04:19
  • yeah i think my big problem was sotring them all in arrays, into memory even when i didnt have to. i just want to be able to search for what i want, then store the important info in memory – Teddy Oct 05 '12 at 04:28
  • 2
    @JonathanHenson That might be true if you were implementing your own IO. In this case, you are simply using a feature of the .NET 4 framework in System.IO that is presenting you with a simple way to work with a file of almost any size. How it performs for you is only proven through testing. – Sumo Oct 05 '12 at 04:38
  • 1
    +1. Especially for test and measure. Obviously one thread per line is shown for pure entertainment, in reality creating more than several threads (especially unbounded number of threads as here) will kill performance of pretty much any task. – Alexei Levenkov Oct 05 '12 at 04:42
  • Thanks Sumo! that was verify helpful - ill try using these method for reading large log files – Teddy Oct 05 '12 at 13:54
  • hm, it only gets the first match of the keyword, how you do keep matching afterwards until the end of the file – Teddy Oct 08 '12 at 19:04
  • @Teddy Of course. It's using Contains, which will only tell you the presence of the keyword on each line you process, allowing you to do something with that specific line. If you need to know how many times that keyword appears on each line, you'll need to use Regex. – Sumo Oct 10 '12 at 12:56
  • @JonathanHenson MS didn't invent anything here. Any shock that ReadLines does buffering behind the scenes derives from some pretty basic ignorance and misunderstanding of how libraries, operating systems, and disks work. It isn't possible to read a line from the disk without reading all the blocks containing any part of the line. Sequential reading necessitates that the block containing the end of the line be either read twice or cached. Windows of course maintains an in-memory disk block cache, and every program that does disk I/O goes through the cache unless intentionally doing raw I/O. – Jim Balter Aug 10 '13 at 05:28