2

I would ask if you could give me some alternatives in my problems.

basically I'm reading a .txt log file averaging to 8 million lines. Around 600megs of pure raw txt file.

I'm currently using streamreader to do 2 passes on those 8 million lines doing sorting and filtering important parts in the log file, but to do so, My computer is taking ~50sec to do 1 complete run.

One way that I can optimize this is to make the first pass to start reading at the end because the most important data is located approximately at the final 200k line(s) . Unfortunately, I searched and streamreader can't do this. Any ideas to do this?

Some general restriction

  • # of lines varies
  • size of file varies
  • location of important data varies but approx at the final 200k line

Here's the loop code for the first pass of the log file just to give you an idea

Do Until sr.EndOfStream = True                                                                              'Read whole File
            Dim streambuff As String = sr.ReadLine                                                      'Array to Store CombatLogNames
            Dim CombatLogNames() As String
            Dim searcher As String

    If streambuff.Contains("CombatLogNames flags:0x1") Then                                             'Keyword to Filter CombatLogNames Packets in the .txt

        Dim check As String = streambuff                                                                'Duplicate of the Line being read
        Dim index1 As Char = check.Substring(check.IndexOf("(") + 1)                                    '
        Dim index2 As Char = check.Substring(check.IndexOf("(") + 2)                                    'Used to bypass the first CombatLogNames packet that contain only 1 entry


        If (check.IndexOf("(") <> -1 And index1 <> "" And index2 <> " ") Then                           'Stricter Filters for CombatLogNames

            Dim endCLN As Integer = 0                                                                   'Signifies the end of CombatLogNames Packet
            Dim x As Integer = 0                                                                        'Counter for array

            While (endCLN = 0 And streambuff <> "---- CNETMsg_Tick")                                    'Loops until the end keyword for CombatLogNames is seen

                streambuff = sr.ReadLine                                                                'Reads a new line to flush out "CombatLogNames flags:0x1" which is unneeded
                If ((streambuff.Contains("---- CNETMsg_Tick") = True) Or (streambuff.Contains("ResponseKeys flags:0x0 ") = True)) Then

                    endCLN = 1                                                                          'Value change to determine end of CombatLogName packet

                Else

                    ReDim Preserve CombatLogNames(x)                                                    'Resizes the array while preserving the values
                    searcher = streambuff.Trim.Remove(streambuff.IndexOf("(") - 5).Remove(0, _
                    streambuff.Trim.Remove(streambuff.IndexOf("(")).IndexOf("'"))                       'Additional filtering to get only valuable data
                    CombatLogNames(x) = search(searcher)
                    x += 1                                                                              '+1 to Array counter

                End If
            End While
        Else
            'MsgBox("Something went wrong, Flame the coder of this program!!")                          'Bug Testing code that is disabled
        End If
    Else
    End If

    If (sr.EndOfStream = True) Then

        ReDim GlobalArr(CombatLogNames.Length - 1)                                                      'Resizing the Global array to prime it for copying data
        Array.Copy(CombatLogNames, GlobalArr, CombatLogNames.Length)                                    'Just copying the array to make it global

    End If
Loop
MDuh
  • 415
  • 1
  • 7
  • 19
  • See http://stackoverflow.com/questions/452902/how-to-read-a-text-file-reversely-with-iterator-in-c-sharp - it's in C#, but you can build that into a library and use it from VB. – Jon Skeet Nov 29 '12 at 08:05
  • @OP: Instead of using `ReDim` (in a LOOP!) I'd suggest to use a List(Of T) and only convert to an array (with .ToArray) when you are done - if at all. – igrimpe Nov 29 '12 at 08:23
  • Could you post some example data from this huge file? Seems you have something like block of texts you are interested on. Perhaps we could devise some form of indexing that could help in subsequent reads – Steve Nov 29 '12 at 09:18
  • [Heres the sample 500 lines of the log file](http://pastebin.com/2S8fYtUX). But I don't think this will help me optimize the code for this. The first pass will read and collect the data from LINES 217-445. The second pass will read and interpret the data from LINES 1-215. LINES 450-530 are garbage for me but is read by a different program. – MDuh Nov 29 '12 at 21:38

2 Answers2

1

You CAN set the BaseStream to the desired reading position, you just cant set it to a specfic LINE (because counting lines requires to read the complete file)

    Using sw As New StreamWriter("foo.txt", False, System.Text.Encoding.ASCII)
        For i = 1 To 100
            sw.WriteLine("the quick brown fox jumps ovr the lazy dog")
        Next

    End Using
    Using sr As New StreamReader("foo.txt", System.Text.Encoding.ASCII)
        sr.BaseStream.Seek(-100, SeekOrigin.End)
        Dim garbage = sr.ReadLine ' can not use, because very likely not a COMPLETE line
        While Not sr.EndOfStream
            Dim line = sr.ReadLine
            Console.WriteLine(line)
        End While
    End Using

For any later read attempt on the same file, you could simply save the final position (of the basestream) and on the next read to advance to that position before you start reading lines.

igrimpe
  • 1,775
  • 11
  • 12
  • I'm gonna try this and set the offset to 3/4 of average file size – MDuh Nov 29 '12 at 21:33
  • I just tried this and I'm not getting a performance boost. Can you tell me if I'm using it improperly? http://pastebin.com/raw.php?i=6rTXRQR6 – MDuh Nov 30 '12 at 04:24
  • @user1862452: I dont know, which kind of "boost" you do expect. You are working on 75% instead of 100%. Depending on your input, you MIGHT save 25% time ... or less. – igrimpe Nov 30 '12 at 08:26
  • Yes I was expecting 25% less time, I tried to benchmark it to my old code vs this seeking code but I got the same running time for both. – MDuh Nov 30 '12 at 12:28
  • @user1862452: Make sure to be aware of this: http://stackoverflow.com/questions/478340/clear-file-cache-to-repeat-performance-testing – igrimpe Nov 30 '12 at 14:12
  • Alright I will try that. I doubt it would change the time though. I did 10 tests for those 2 versions. (I got a execution timer coded in the program) – MDuh Nov 30 '12 at 19:07
0

What worked for me was skipping first 4M lines (just a simple if counter > 4M surrounding everything inside the loop), and then adding background workers that did the filtering, and if important added the line to an array, while main thread continued reading the lines. This saved about third of the time at the end of a day.

Pierre.Vriens
  • 2,117
  • 75
  • 29
  • 42
Scheep
  • 1