2

Good Day – I have a routine that searches through tens of thousands of text files to capture relevant file names resulting in just under 10,000 of them remaining. These filenames provide the starting point for the next routine to open each of these remaining text files to search for specific data. Each text file may have anywhere between 50 and 50,000 lines of data. Looping through each file the routine first finds a serial number and then finds all occurrences of FAILED which it then captures the date/time stamp near the beginning of that line of text and populates another worksheet. This all works great for the various failure occurrences. Except for one.

(See Beginning of area of concern in code sample below) I have now come to the area where I would appreciate some guidance. There is one instance of FAILED that I need to verify if PASSED occurs three lines after the line containing FAILED. If PASSED is there, it will always be the third line below FAILED. I cannot use the date/time stamps such as PASSED occurring within a certain number of seconds after FAILED as it varies too much and may produce false results. I thought the best way to capture the first PASSED after the FAILED would be to capture the FAILED’s FirstIndex location and begin a search for PASSED from there. But, I have no idea how to do that or if it is even possible. To be honest, I don’t know if this can be done using RegEx or something in VBA that I haven’t thought of. I am just learning RegEx, so I am very weak in this area, even after everything that I have read. An explanation with an answer would be most appreciated. VBA I am very comfy with. I am using Excel 2010 Professional with Microsoft VBScript Regular Expression 5.5 referenced.

I thought that this answer might help, but if it would I do not understand it. How to get the position of submatches in VBA? Any help or guidance would be most appreciated. Thank You in advance.

Kind Regards, Mark

Sample sanitized search data

LOG: 00::01:11:03.129 [XXX_##] XXX: 3390, 3412, 3401, 3400, 3401, 3398, 3402, 3409 0090123101000172

LOG: 00::01:11:15.576 [XXX_###] XXX: 3393, 3399, 3393, 3395, 3394, 3396, 3397, 3395 0090123101000200

LOG: 00::01:11:23.568 [XXX_##] XXX: 3390, 3411, 3401, 3400, 3401, 3398, 3402, 3409 0090123101000173

LOG: 00::01:11:37.049 [XXX_###] XXX: 3393, 3400, 3393, 3394, 3394, 3396, 3396, 3395 0090123101000201

LOG: 00::01:11:53.265 [XXX_##] XXX: 3388, 3409, 3399, 3397, 3399, 3396, 3400, 3406 0090123101000129

LOG: 00::01:11:56.361 [XXX_###] XXX: 3393, 3399, 3392, 3394, 3394, 3396, 3396, 3395 0090123101000202

LOG: 00::01:12:14.596 [XXX_##] XXXX Xxxxxxxxxxxxx Xxxxxxxxxxx FAILED, Xxxxxxxxxxx: A:1, 0090123101000130

LOG: 00::01:12:16.432 [XXX_##] XXXX ADC 3401, 3402, 3401, 3399, 3399, 3401, 3399, 3401,

LOG: 00::01:12:16.502 [XXX_##] XXXX DAC 1477, 1301, 1405, 1229, 1406, 1473, 1770, 1543,

LOG: 00::01:12:16.581 [XXX_##] XXXX Xxxxxxxxxxxxx Xxxxxxxxxxx PASSED, Xxxxxxxxxxx: 1

LOG: 00::01:12:16.846 [XXX_##] XXX: 3407, 3408, 3406, 3405, 3405, 3406, 3404, 3405 0090123101000130

LOG: 00::01:12:17.406 [XXX_###] XXX: 3398, 3403, 3397, 3400, 3399, 3401, 3402, 3399 0090123101000203

LOG: 00::01:12:37.508 [XXX_##] XXX: 3402, 3402, 3400, 3398, 3400, 3401, 3400, 3401 0090123101000131

LOG: 00::01:12:38.511 [XXX_###] XXX: 3386, 3393, 3386, 3386, 3387, 3389, 3389, 3387 0090123101000204

LOG: 00::01:13:02.619 [XXX_##] XXX: 3403, 3402, 3400, 3397, 3400, 3401, 3399, 3401 0090123101000132

    Dim bFound              As Boolean          'Used to identify if sFile <> "".
    Dim dHr                 As Double    'Test  'Number of hours in dEndTime
    Dim dMin                As Double    'Test  'Number of minutes in dEndTime.
    Dim dSec                As Double    'Test  'Number of seconds in dEndTime.
    Dim dStartTime          As Double    'Test  'Time routine starts.
    Dim dEndTime            As Double    'Test  'Time routine completes.
    Dim i                   As Integer          'Array variable for rows.
    Dim iCurrentRow         As Integer          'Variable used in centering filename cells.
    Dim iNextRow            As Integer          'Used to find last row in column to add new data.
    Dim j                   As Integer          'Array variable for columns.
    Dim LastRow             As Integer          'Last row used by any column in current range.
    Dim NextRow             As Integer          'Last row of current column.
    Dim z                   As Integer          'Counter for files > 200 bytes.
    Dim lFileLen            As Long             'Length of text file.
    Dim oM                  As Object           'Single match.
    Dim oMtch               As Object           'Match collection.
    Dim oS                  As Object           'Number of matches found.
    Dim LastCol             As String           'Identify last column used.
    Dim LastColLetter       As String           'Last Column letter.
    Dim s1LastCol           As String           'Identify last column in Row 1 used.
    Dim s1LastColLetter     As String           'Last column in Row 1 letter.
    Dim sCurrCol            As String           'Numerical value of current column.
    Dim sCurrColLetter      As String           'Alphabetical value of current column.
    Dim sFile               As String           'File name to search in.
    Dim sFn                 As String           'Combined path and file to search in.
    Dim sPath               As String           'Path of file to search in.
    Dim sTxt                As String           'Variable to hold scripting.filesystemobject.
    Dim vArr                As Variant          'Array containing all finlenames.

'   Turn the following activity off to increase program speed.
    With Application
        .StatusBar = True
        .EnableEvents = False
        .DisplayAlerts = False
        .ScreenUpdating = False
    End With

    dStartTime = Now()                                   'For testing purposes ONLY.

    Sheets("Failures").Activate
    LastCol = ActiveSheet.UsedRange.SpecialCells(xlLastCell).Column
    If LastCol > 26 Then
        LastColLetter = Chr(Int((LastCol - 1) / 26) + 64) & Chr(((LastCol - 1) Mod 26) + 65)
    Else
        LastColLetter = Chr(LastCol + 64)
    End If

'   Get last row used by any column in current range.
    LastRow = ActiveSheet.UsedRange.Rows.Count

'   Set range values.
    vArr = Range("B1:" & LastColLetter & LastRow).Value
    Columns("B:" & LastColLetter).Delete Shift:=xlToLeft    'Delete previous data.

    sPath = "U:\Serial_Server_Data\"                        'Get path name.

    Sheets("Log Files").Activate

'   Will provide the last used column letter.
    LastCol = ActiveSheet.UsedRange.SpecialCells(xlLastCell).Column
    If LastCol > 26 Then
        LastColLetter = Chr(Int((LastCol - 1) / 26) + 64) & Chr(((LastCol - 1) Mod 26) + 65)
    Else
        LastColLetter = Chr(LastCol + 64)
    End If

'   Get last row used by any column in current range.
    LastRow = ActiveSheet.UsedRange.Rows.Count

'   Set range values.
    vArr = Range("C2:" & LastColLetter & LastRow).Value

'   Initialize variables.
    z = 1
    bFound = False

'   Step through files to apply Pattern to.
    For i = LBound(vArr, 1) To UBound(vArr, 1)          'Step through rows to apply Pattern to.
        For j = LBound(vArr, 2) To UBound(vArr, 2)      'Step through columns to apply Pattern to.

            If vArr(i, j) = "" Then GoTo SkipAll        'Skip cell if empty.

            sFile = vArr(i, j)                          'Get file name.
            lFileLen = GetDirOrFileSize(sPath, sFile)   'Get the file size for later use.

            If lFileLen > 200 Then          'Only search files that are over 200 bytes in length.
                Application.StatusBar = "Processing file " & z & " - " & sFile

'               Create full path with filename.
                sFn = sPath & sFile

'               Determine the next file number available for use by the FileOpen function
                sTxt = FreeFile

                sTxt = CreateObject("scripting.filesystemobject").OpenTextFile(sFn).ReadAll

'                i = 0
                With CreateObject("vbscript.regexp")    'Search for serial number.
                    .Global = False                     'Search for first instance.
                    .IgnoreCase = True                  'Select either upper or lowercase.
                    .Pattern = "Serial\sNo.\s\d{4}"
                    Set oMtch = .Execute(sTxt)
                    For Each oM In oMtch
                        For Each oS In .Execute(oM.Value)
'                            Debug.Print oS.Value

                            If oS <> vbNull Then    'Continue on only if serial number found.
                                Sheets("Failures").Activate
                                Range("A1").Activate
                                Do While ActiveCell.Value <> ""
                                    ActiveCell.Offset(0, 1).Activate

'                                   sFile already exists.
                                    If ActiveCell.Value = Right(oS.Value, 4) Then
                                        sCurrCol = ActiveCell.Column
                                        Do While ActiveCell.Value <> ""
                                            ActiveCell.Offset(1, 0).Activate
                                        Loop
                                        ActiveCell.Value = sFile

'                                       Get column letter from column number.
                                        If sCurrCol > 26 Then
                                            sCurrColLetter = Chr(Int((sCurrCol - 1) / 26) + 64) _
                                                & Chr(((sCurrCol - 1) Mod 26) + 65)
                                        Else
                                            sCurrColLetter = Chr(sCurrCol + 64)
                                        End If

'                                       Center cell.
                                        iCurrentRow = Application.WorksheetFunction.CountA(Range _
                                            (sCurrColLetter & ":" & sCurrColLetter))
                                        Range(sCurrColLetter & iCurrentRow).HorizontalAlignment _
                                            = xlCenter

'                                       Adjust the column to fit file name.
                                        Columns(sCurrColLetter & ":" & _
                                            sCurrColLetter).ColumnWidth = 35
                                        bFound = True
                                        z = z + 1
                                        Exit Do
                                    End If
                                Loop

'                               sFile doesn't exist.
                                If ActiveCell.Value = "" And bFound = False Then
                                    ActiveCell.Value = Right(oS.Value, 4)
                                    ActiveCell.Offset(1, 0).Value = sFile
                                    sCurrCol = ActiveCell.Column

'                                   Get column letter from column number.
                                    If sCurrCol > 26 Then
                                        sCurrColLetter = Chr(Int((sCurrCol - 1) / 26) + 64) _
                                            & Chr(((sCurrCol - 1) Mod 26) + 65)
                                    Else
                                        sCurrColLetter = Chr(sCurrCol + 64)
                                    End If

'                                   Center cell.
                                    iCurrentRow = Application.WorksheetFunction.CountA(Range _
                                        (sCurrColLetter & ":" & sCurrColLetter))
                                    Range(sCurrColLetter & iCurrentRow).HorizontalAlignment _
                                        = xlCenter

'                                   Adjust the column to fit file name.
                                    Columns(sCurrColLetter & ":" & sCurrColLetter).ColumnWidth _
                                        = 35
                                    z = z + 1
                                End If
                            End If
                        Next
                    Next

'   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
'   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
'   >>>>>>>>>>  Beginning of area of concern.
'   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
'   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

                    .Global = True                      'Search for instance.
                    .IgnoreCase = True                  'Select either upper or lowercase.

'                   Search for 'failed' with any amount of text on either side.
                    .Pattern = ".*failed.*"
                    Set oMtch = .Execute(sTxt)
                    For Each oM In oMtch
                        For Each oS In .Execute(oM.Value)
                            iNextRow = Application.WorksheetFunction.CountA(Range _
                                (sCurrColLetter & ":" & sCurrColLetter)) + 1
                            If Left(oS.Value, 4) = "LOG:" Then

'                               Ignore FLR-x PeakDetector Dash failure.
                                If UCase(Mid(oS.Value, 32, 3)) <> "FLR" Then

'                                   Print all other "Failed" occurances.
                                    Range(sCurrColLetter & iNextRow).Activate
                                    ActiveCell.Value = Mid(oS.Value, 6, 16)
                                End If
                            End If
                            If Mid(oS.Value, 4, 4) = "LOG:" Then
                                Range(sCurrColLetter & iNextRow).Activate
                                ActiveCell.Value = Mid(oS.Value, 9, 16)
                            End If
                        Next
                    Next
                End With
            End If

'   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
'   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
'   >>>>>>>>>>  End of area of concern.
'   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
'   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

SkipAll:
            bFound = False
            Sheets("Log Files").Activate
        Next j
    Next i

'   Cleanup: Add borders, heading background fill, remove gridlines.
GoodMD
  • 21
  • 2
  • I wouldn't even attempt to process the entire file at once. Use `Split(sTxt, vbCrLf)` to get an array of lines. The loop through the resulting array line by line looking for "FAILED". If you find that, scan forward in the array up to 3 lines looking for "PASSED". – Comintern Oct 24 '18 at 12:44
  • I would follow the lead in PeterT's answer below and read through the data with `.readline` instead of `.readall`, although that would require restructuring the code somewhat. – Egalth Oct 24 '18 at 12:46
  • @Egalth Disagree. We're talking about "tens of thousands" of files - reading them line by line is going to be slooooooow. – Comintern Oct 24 '18 at 12:49
  • (@Comintern, se my comment in PeterT's answer) – Egalth Oct 24 '18 at 12:53

1 Answers1

3

This is an indirect answer to your question: at the risk of starting a flamewar, I'm not a fan of regex. I've used it before, mostly in bash scripting in Perl, but have (almost) always been able to work around it when developing text parsing code. Below is an example of how to approach your specific problem. Clearly, adapting my approach will cause a refactoring of your code. Please consider this as an alternative.

My approach ingests the log file into a VBA Collection where each Item in the collection is a separate line. I'm using a collection instead of a String() array because the collection readily expands for an unknown number of lines, whereas an array must be ReDim-ed with prior knowledge of exactly how many lines there are (probably causing a double loop, double read of the same file).

Private Function GetFileByLines(ByVal filePath As String) As Collection
    Dim fso As FileSystemObject
    Set fso = New FileSystemObject

    Dim txtStream As Object
    Set txtStream = fso.OpenTextFile(filePath, ForReading, False)

    Dim lines As Collection
    Set lines = New Collection

    Do While Not txtStream.AtEndOfStream
        Dim line As String
        lines.Add txtStream.ReadLine
    Loop
    txtStream.Close
    Set GetFileByLines = lines
End Function

Once you have the collection of lines from the text file, it's then an easy loop with indexing to check if "PASSED" exists three lines after a "FAILED" line.

Private Sub ScanInputFile(ByVal filename As String)
    Dim fileLines As Collection
    Set fileLines = GetFileByLines(filename)

    Dim i As Long
    For i = 1 To fileLines.Count
        If LCase(fileLines(i)) Like "*failed*" Then
            '--- check to make sure we're not near the end of the file
            If i + 3 < fileLines.Count Then
                If LCase(fileLines(i + 3)) Like "*passed*" Then
                    Debug.Print "found a PASSED line three lines after FAILED, " & _
                                "lines " & i & " and " & i + 3
                End If
            End If
        End If
    Next i
End Sub

Obviously, this does not integrate directly into your whole solution as I'm not addressing parts that you have already coded and debugged. As a quick code review though, please read up on avoiding the use of Activate and Select.

Here's the whole test module I used. The "testlog.txt" file is a direct copypasta from your data above.

Option Explicit

Public Sub test()
    ScanInputFile "C:\Temp\testlog.txt"
End Sub

Private Sub ScanInputFile(ByVal filename As String)
    Dim fileLines As Collection
    Set fileLines = GetFileByLines(filename)

    Dim i As Long
    For i = 1 To fileLines.Count
        If LCase(fileLines(i)) Like "*failed*" Then
            '--- check to make sure we're not near the end of the file
            If i + 3 < fileLines.Count Then
                If LCase(fileLines(i + 3)) Like "*passed*" Then
                    Debug.Print "found a PASSED line three lines after FAILED, " & _
                                "lines " & i & " and " & i + 3
                End If
            End If
        End If
    Next i
End Sub

Private Function GetFileByLines(ByVal filePath As String) As Collection
    Dim fso As FileSystemObject
    Set fso = New FileSystemObject

    Dim txtStream As Object
    Set txtStream = fso.OpenTextFile(filePath, ForReading, False)

    Dim lines As Collection
    Set lines = New Collection

    Do While Not txtStream.AtEndOfStream
        Dim line As String
        lines.Add txtStream.ReadLine
    Loop
    txtStream.Close
    Set GetFileByLines = lines
End Function
PeterT
  • 8,232
  • 1
  • 17
  • 38
  • 2
    The approach is spot on, but I'd consider reading the entire file at once and splitting it into an array. Using a `Collection` and reading individual lines from "tens of thousands of text files" is likely to be considerably slower. – Comintern Oct 24 '18 at 12:48
  • This approach is attractive. I find that VBA's `Like` operator often does the job of `regex` simpler and faster. – Egalth Oct 24 '18 at 12:50
  • @Comintern, good point, but are you sure about that or is it a hypothesis? I have some code that uses basically the same idea and I find that reading line by line is typically very fast, less than a second for tens of thousands of lines in a bunch of files, but I haven't tried it against tens of thousands of *files*... – Egalth Oct 24 '18 at 12:51
  • 1
    @Egalth Positive. Assuming sufficient memory to hold the entire file, limiting the IO operations will remove a significant bottleneck. For individual files it probably isn't that big a deal, but it adds up. – Comintern Oct 24 '18 at 12:55
  • Are streams in VBA implemented such that files less than a certain size would be read into memory (internal to the stream) and then `ReadLine` would then perform memory accesses? (I seem to remember I/O streams in C++ could use this approach for performance improvements, addressing your exact concerns.) – PeterT Oct 24 '18 at 13:07
  • OMG, Thank You everyone for your input. I am trying to digest it now and will come back in a little while with my findings. Thank You SO much for everything. – GoodMD Oct 24 '18 at 13:15
  • @PeterT Think memory copies. The internal implementation of `ReadLine` isn't as relevant as the fact that you perform N memory allocations and copies as opposed to 1. – Comintern Oct 24 '18 at 13:19
  • PeterT, this is working perfectly. I've added a few additional lines to do exactly what is needed and this can be added to what I already have to clean up the final worksheet. Thank You so very much. I do have a question as I don't completely understand the ScanInputFile routine. I paused the code to take a snapshot of what was happening and I noticed that in the Locals window the filelines collection only has 256 lines whereas one of the files that I tested resulted in the Debug.Print displaying a PASSED find on line 24536. Could you explain how that happens? – GoodMD Oct 24 '18 at 14:44
  • If the `Debug.Print` said that PASSED was found on line 24536, then the Collection has to have more than 256 entries. I've seen collections in the locals window that have had many more than 256 items, so I'm not sure what's happening. But it appears to be working fine. You can also consider following @Comintern's advice to read the whole file in one operation and returning a string array by using `Split`, which would give you a bit better performance. – PeterT Oct 24 '18 at 16:35
  • Just out of curiosity I tested my own script, which is performing similar work using `.readline` (but no regex). Parsing one file with 3300 lines takes less than a second, while parsing 100 such files increases execution time to 10 secs. Point taken, thanks @Comintern :) – Egalth Oct 26 '18 at 09:14