1

I have a simple .txt log file to which an application adds lines as it does its work. The lines consist of a timestamp and a variable-length text:

17-06-25 06:37:43 xxxxxxxxxxxxxxx
17-06-25 06:37:46 yyyyyyy
17-06-25 06:37:50 zzzzzzzzzzzzzzzzzzzzzzzzzzzz
...

I need to extract all lines with a timestamp greater than a certain date-time. This typically is about the last, say, 20-40 log entries (lines).

The problem is, that the file is large and growing.

If all lengths would be equal, I'd invoke a binary search. But they aren't, and so I end up using something like:

Private Sub ExtractNewestLogs(dEarliest As Date)
    Dim sLine As String = ""
    Dim oSRLog As New StreamReader(gsFilLog)

    sLine = oSRLog.ReadLine()
    Do While Not (sLine Is Nothing)
        Debug.Print(sLine)
        sLine = oSRLog.ReadLine()
    Loop
End Sub

which, well, isn't really fast.

Is there a method with which I can read such files "backwards", i.e., last line first? If not, what other option do I have?

  • Could you make your logging framework create a file per day? – Steve Jun 25 '17 at 08:27
  • @Steve, thanks for the suggestion, but no, that's not feasible: other applications would be in need to be modified. –  Jun 25 '17 at 10:38
  • There isn't too much to do then. I suggest to write some kind of service that at midnight (or whenever you like) reads that log file and split it by day in a separate folder. Then your app will have a lot easier work to do. (It could also reset it to zero after that to allow a clean restart of the log after some days) – Steve Jun 25 '17 at 10:41
  • Here an example for a service running at specific time: https://stackoverflow.com/questions/19151363/windows-service-to-run-a-function-at-specified-time – Steve Jun 25 '17 at 10:43
  • @Steve, thanks again. You know, I'm actually thinking of treating this file as a binary file, doing a binary search anyway, starting at the file's mid length position, then searching for the next CRLF+1 to obtain the (fixed-sized) timestamp, and as soon as the two are identical... –  Jun 25 '17 at 11:41
  • As you like. But it could a more complex work. You need to weight well your options. And sooner or later that file size will become a problem – Steve Jun 25 '17 at 12:18
  • @Steve, well NTFS allows for file sizes up to 16 EB, with Win8/S2012 still 256 TB, that's quite many entries for my log file. –  Jun 25 '17 at 14:31
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147583/discussion-between-herb-and-steve). –  Jun 26 '17 at 03:10

2 Answers2

1

The function below will return the last x number of characters from a file as an array of strings using a binary reader. You can then pull the last records that you want much more quickly than reading the entire log file. You can fine tune the number of bytes to read according to a rough approximation of how many bytes are taken by the last 20-40 log entries. On my pc - it took <10ms to read the last 10,000 characters of a 17mb text file.

Of course this code assumes that your log file is plain ascii text.

Private Function ReadLastbytes(filePath As String, x As Long) As String()
    Dim fileData(x - 1) As Byte
    Dim tempString As New StringBuilder
    Dim oFileStream As New FileStream(filePath, FileMode.Open, FileAccess.Read)
    Dim oBinaryReader As New BinaryReader(oFileStream)
    Dim lBytes As Long
    If oFileStream.Length > x Then
        lBytes = oFileStream.Length - x
    Else
        lBytes = oFileStream.Length
    End If
    oBinaryReader.BaseStream.Seek(lBytes, SeekOrigin.Begin)
    fileData = oBinaryReader.ReadBytes(lBytes)
    oBinaryReader.Close()
    oFileStream.Close()
    For i As Integer = 0 To fileData.Length - 1 
        If fileData(i)=0 Then i+=1
        tempString.Append(Chr(fileData(i)))
    Next
    Return tempString.ToString.Split(vbCrLf)
End Function
David Wilson
  • 4,369
  • 3
  • 18
  • 31
0

I attempted a binary search anyway, eventhough the file has not static line lengths.

First some considerations, then the code:


Sometimes it is needed, that the last n lines of a log file are extracted, based on an ascending sort key at the beginning of the line. The key really could be anything, but in log files typically represents a date-time, usually in the format YYMMDDHHNNSS (possibly with some interpunction).

Log files typically are text based files, consisting of multiple lines, at times millions of them. Often log files feature fixed-length line widths, in which case a specific key is quite easy to access with a binary search. However, probably also as often, log files have a variable line width. To access these, one can use an estimate of an average line width in order to calculate a file position from the end, and then process from there sequentially to the EOF.

But one can employ a binary approach also for this type of files, as demonstrated here. The advantage comes in, as soon as file sizes grow. A log file's maximum size is determined by the file system: NTFS allows for 16 EiB (16 x 2^60 B), theoretically; in practice under Windows 8 or Server 2012, it's 256 TiB (256 x 2^40 B).

(What 256 TiB actually means: a typical log file is designed to be readable by a human and rarely exceeds many more than 80 characters per line. Let's assume your log file logs along happily and completely uninterrupted for astonishing 12 years for a total of 4,383 days at 86,400 seconds each, then your application is allowed to write 9 entries per millisecond into said log file, to eventually meet the 256 TiB limit in its 13th year.)

The great advantage of the binary approach is, that n comparisons suffice for a log file consisting of 2^n bytes, rapidly gaining advantage as the file size becomes larger: whereas 10 comparisons are required for file sizes of 1 KiB (1 per 102.4 B), there are only 20 comparisons needed for 1 MiB (1 per 50 KiB), 30 for 1 GiB (1 per 33⅓ MiB), and a mere 40 comparisons for files sized 1 TiB (1 per 25 GiB).

To the function. These assumptions are made: the log file is encoded in UTF8, the log lines are separated by a CR/LF sequence, and the timestamp is located at the beginning of each line in ascending order, probably in the format [YY]YYMMDDHHNNSS, possibly with some interpunction in between. (All of these assumptions could easily be modified and cared for by overloaded function calls.)

In an outer loop, binary narrowing is done by comparing the provided earliest date-time to match. As soon as a new position within the stream has been found binarily, an independent forward search is made in an inner loop to locate the next CR/LF-sequence. The byte after this sequence marks the start of the record's key being compared. If this key is larger or equal the one we are in search for, it is ignored. Only if the found key is smaller than the one we are in search for its position is treated as a possible condidate for the record just before the one we want. We end up with the last record of the largest key being smaller than the searched key.

In the end, all log records except the ultimate candidate are returned to the caller as a string array.

The function requires the import of System.IO.

Imports System.IO

'This function expects a log file which is organized in lines of varying
'lengths, delimited by CR/LF. At the start of each line is a sort criterion
'of any kind (in log files typically YYMMDD HHMMSS), by which the lines are
'sorted in ascending order (newest log line at the end of the file). The
'earliest match allowed to be returned must be provided. From this the sort
'key's length is inferred. It needs not to exist neccessarily. If it does,
'it can occur multiple times, as all other sort keys. The returned string
'array contains all these lines, which are larger than the last one found to 
'be smaller than the provided sort key.
Public Shared Function ExtractLogLines(sLogFile As String,
    sEarliest As String) As String()

    Dim oFS As New FileStream(sLogFile, FileMode.Open, FileAccess.Read,
        FileShare.Read)             'The log file as file stream.
    Dim lMin, lPos, lMax As Long    'Examined stream window.
    Dim i As Long                   'Iterator to find CR/LF.
    Dim abEOL(0 To 1) As Byte       'Bytes to find CR/LF.
    Dim abCRLF() As Byte = {13, 10} 'Search for CR/LF.
    Dim bFound As Boolean           'CR/LF found.
    Dim iKeyLen As Integer = sEarliest.Length      'Length of sort key.
    Dim sActKey As String           'Key of examined log record.
    Dim abKey() As Byte             'Reading the current key.
    Dim lCandidate As Long          'File position of promising candidate.
    Dim sRecords As String          'All wanted records.

    'The byte array accepting the records' keys is as long as the provided
    'key.
    ReDim abKey(0 To iKeyLen - 1)   '0-based!

    'We search the last log line, whose sort key is smaller than the sort
    'provided in sEarliest.
    lMin = 0                        'Start at stream start
    lMax = oFS.Length - 1 - 2       '0-based, and without terminal CRLF.
    Do
        lPos = (lMax - lMin) \ 2 + lMin     'Position to examine now.

        'Although the key to be compared with sEarliest is located after
        'lPos, it is important, that lPos itself is not modified when
        'searching for the key.
        i = lPos                    'Iterator for the CR/LF search.
        bFound = False
        Do While i < lMax
            oFS.Seek(i, SeekOrigin.Begin)
            oFS.Read(abEOL, 0, 2)
            If abEOL.SequenceEqual(abCRLF) Then    'CR/LF found.
                bFound = True
                Exit Do
            End If
            i += 1
        Loop
        If Not bFound Then
            'Between lPos and lMax no more CR/LF could be found. This means,
            'that the search is over.
            Exit Do
        End If
        i += 2                              'Skip CR/LF.
        oFS.Seek(i, SeekOrigin.Begin)       'Read the key after the CR/LF
        oFS.Read(abKey, 0, iKeyLen)         'into a string.
        sActKey = System.Text.Encoding.UTF8.GetString(abKey)

        'Compare the actual key with the earliest key. We want to find the
        'largest key just before the earliest key.
        If sActKey >= sEarliest Then
            'Not interested in this one, look for an earlier key.
            lMax = lPos
        Else
            'Possibly interesting, remember this.
            lCandidate = i
            lMin = lPos
        End If
    Loop While lMin < lMax - 1

    'lCandidate is the position of the first record to be taken into account.
    'Note, that we need the final CR/LF here, so that the search for the 
    'next CR/LF sequence following below will match a valid first entry even
    'in case there are no entries to be returned (sEarliest being larger than
    'the last log line). 
    ReDim abKey(CInt(oFS.Length - lCandidate - 1))  '0-based.
    oFS.Seek(lCandidate, SeekOrigin.Begin)
    oFS.Read(abKey, 0, CInt(oFS.Length - lCandidate))

    'We're done with the stream.
    oFS.Close()

    'Convert into a string, but omit the first line, then return as a
    'string array split at CR/LF, without the empty last entry.
    sRecords = (System.Text.Encoding.UTF8.GetString(abKey))
    sRecords = sRecords.Substring(sRecords.IndexOf(Chr(10)) + 1)

    Return sRecords.Split(ControlChars.CrLf.ToCharArray(),
        StringSplitOptions.RemoveEmptyEntries)
End Function