1

I am experiencing an unexpected vb limitation on the string max size, as explained in this post: VBA unexpected reach of string size limit

While I was expecting to be able to load files up to 2GB (2^31 char) using open path for binary and get function, I get an out of string space error when I try to load a string larger than 255,918,061 characters.

I managed to work around this issue buffering the input stream of get. The problem is that I need to load the file as an array of string by splitting the buffer on vbCrLf characters.

This requires then to build the array line by line. Moreover, since I cannot be sure whether the buffer is ending on a break line or not I need additional operations. This solution is Time and Memory consuming. Loading a file of 300MB with this code costs 900MB (!) use of memory by excel. Is there a better solution ?

Here bellow is my code:

Function Load_File(path As String) As Variant
Dim MyData As String, FNum As Integer
Dim LenRemainingBytes As Long
Dim BufferSizeCurrent As Long
Dim FileByLines() As String
Dim CuttedLine As Boolean
Dim tmpSplit() As String
Dim FinalSplit() As String
Dim NbOfLines As Long
Dim LastLine As String
Dim count As Long, i As Long
Const BufferSizeMax As Long = 100000

FNum = FreeFile()
Open path For Binary As #FNum

LenRemainingBytes = LOF(FNum)
NbOfLines = FileNbOfLines(path)
ReDim FinalSplit(NbOfLines)
CuttedLine = False

Do While LenRemainingBytes > 0
    MyData = ""
    If LenRemainingBytes > BufferSizeMax Then
        BufferSizeCurrent = BufferSizeMax
    Else
        BufferSizeCurrent = LenRemainingBytes
    End If
    MyData = Space$(BufferSizeCurrent)
    Get #FNum, , MyData

    tmpSplit = Split(MyData, vbCrLf)
    If CuttedLine Then
        count = count - 1
        tmpSplit(0) = LastLine & tmpSplit(0)
        For i = 0 To UBound(tmpSplit)
            If count > NbOfLines Then Exit For
            FinalSplit(count) = tmpSplit(i)
            count = count + 1
        Next i
    Else
        For i = 0 To UBound(tmpSplit)
            If count > NbOfLines Then Exit For
            FinalSplit(count) = tmpSplit(i)
            count = count + 1
        Next i
    End If
    Erase tmpSplit

    LastLine = Right(MyData, Len(MyData) - InStrRev(MyData, vbCrLf) - 1)
    CuttedLine = Len(LastLine) > 1
    LenRemainingBytes = LenRemainingBytes - BufferSizeCurrent
Loop
Close FNum
Load_File = FinalSplit
Erase FinalSplit
End Function

Where the function FileNbOfLines is efficiently returning the number of line break characters.

Edit:

My Needs are:

  1. To look for a specific string within the file
  2. To get a specific number of lines coming after this string
Community
  • 1
  • 1
JiB
  • 41
  • 7
  • 1
    Do you actually need the entire contents of the file, or are you looking for something within the file in particular? What is the end goal here? – SierraOscar Jun 18 '15 at 07:36
  • Ok, so I figured out that I cannot avoid the increase of memory use since I'm using the split function... @SO: At the end of my routine, each line will be read. I am actually looking for a typical string in the file, reading all the lines bellow, until I reach the next typical string. Do you think a 'ReadLine' would be more efficient ? – JiB Jun 18 '15 at 07:55
  • 1
    I think using windows scripting (cmd) would be faster to search for the string you want and get the line number(s) - you can create the `WScript.Shell` object via late binding and work with that. Once you have the line numbers you can use ReadLine and target specific areas of the file to save time and memory – SierraOscar Jun 18 '15 at 07:57
  • Thank you, I will test this solution. – JiB Jun 18 '15 at 08:10
  • 1
    For file i/o, in VBA you can use early binding with the `Microsoft Scripting Runtime` and the `FileSystemObject`. Early binding gives you the advantage that intellisense works. More info can be found [here](http://stackoverflow.com/questions/3233203/how-do-i-use-filesystemobject-in-vba/3236348#3236348) – ChipsLetten Jun 18 '15 at 08:55
  • MyData = 300MB after the file read. tmpSplit = 300MB after the split. FinalSplit = up to 300MB through the course of the Function. Finally, functions return arrays by value, not by reference, so you eat up to another 300MB when you assign the return value. 900MB of memory usage sounds about right (depending on what you filter from the file). – Comintern Jun 18 '15 at 23:31
  • Erase should free memory usage, isn't it ? – JiB Jun 19 '15 at 06:53
  • @SO: Your first suggestion was obsiously the better. It is quite rare the time you really need to load 300MB in memory. I updated my code in order to know precisely the amount of data I was looking for. I post my solution here bellow. – JiB Jun 19 '15 at 13:46

2 Answers2

0

Here you go, not pretty but should give you the general concept:

Sub GetLines()

Const fileName      As String = "C:\Users\bloggsj\desktop\testfile.txt"
Const wordToFind    As String = "FindMe"
Dim lineStart       As String
Dim lineCount       As String
Dim linesAfterWord  As Long


With CreateObject("WScript.Shell")
    lineCount = .Exec("CMD /C FIND /V /C """" """ & fileName & """").StdOut.ReadAll
    lineStart = Split(.Exec("CMD /C FIND /N """ & wordToFind & """ """ & fileName & """").StdOut.ReadAll, vbCrLf)(2)
End With

linesAfterWord = CLng(Trim(Mid(lineCount, InStrRev(lineCount, ":") + 1))) - CLng(Trim(Mid(lineStart, 2, InStr(lineStart, "]") - 2)))

Debug.Print linesAfterWord

End Sub

Uses CMD to count the number of lines, then find the line at which the word appears, then subtract one from the other to give you the amount of lines after the word has been found.

SierraOscar
  • 17,507
  • 6
  • 40
  • 68
  • OK, quite shorter than what I wrote ! I will test this on Monday. – JiB Jun 19 '15 at 15:58
  • I'll need to tweak this, I mis-read your requirements and this will only return the _number_ of lines after the word is found, not the _actual_ lines. Will update when I get a chance – SierraOscar Jun 19 '15 at 16:12
0

Answer: Yes, using ReadAll from FSO should do the job.

Best answer: Just avoid it !

My needs were:

  1. Identify a specific string within the file
  2. Extract a certain number of lines after this string

As far as you precisely know the exact amout of data you want to extract, and assuming this amount of data is below vba string size limit (!), here is what it does the job the faster.

Decrease of computation time is improved using binary comparison of strings. My code is as follows:

Function GetFileLines(path As String, str As String, NbOfLines As Long) As String()
    Const BUFSIZE As Long = 100000
    Dim StringFound As Boolean
    Dim lfAnsi As String
    Dim strAnsi As String
    Dim F As Integer
    Dim BytesLeft As Long
    Dim Buffer() As Byte
    Dim strBuffer As String
    Dim BufferOverlap As String
    Dim PrevPos As Long
    Dim NextPos As Long
    Dim LineCount As Long
    Dim data As String

    F = FreeFile(0)
    strAnsi = StrConv(str, vbFromUnicode) 'Looked String
    lfAnsi = StrConv(vbLf, vbFromUnicode) 'LineBreak character

    Open path For Binary Access Read As #F

    BytesLeft = LOF(F)
    ReDim Buffer(BUFSIZE - 1)
    'Overlapping buffer is 3/2 times the size of strBuffer 
    '(two bytes per character)
    BufferOverlap = Space$(Int(3 * BUFSIZE / 4)) 

    StringFound = False
    Do Until BytesLeft = 0
        If BytesLeft < BUFSIZE Then ReDim Buffer(BytesLeft - 1)
        Get #F, , Buffer
        strBuffer = Buffer 'Binary copy of bytes.
        BytesLeft = BytesLeft - LenB(strBuffer)
        Mid$(BufferOverlap, Int(BUFSIZE / 4) + 1) = strBuffer 'Overlapping Buffer

        If Not StringFound Then 'Looking for the the string
            PrevPos = InStrB(BufferOverlap, strAnsi) 'Position of the looked string within the buffer
            StringFound = PrevPos <> 0
            If StringFound Then strBuffer = BufferOverlap
        End If
        If StringFound Then 'When string is found, loop until NbOfLines
            Do Until LineCount = NbOfLines
                NextPos = InStrB(PrevPos, strBuffer, lfAnsi)
                If NextPos = 0 And LineCount < NbOfLines Then 'Buffer end reached, NbOfLines not reached
                    'Adding end of buffer to data 
                    data = data & Mid$(StrConv(strBuffer, vbUnicode), PrevPos)
                    PrevPos = 1
                    Exit Do
                Else
                    'Adding New Line to data 
                    data = data & Mid$(StrConv(strBuffer, vbUnicode), PrevPos, NextPos - PrevPos + 1)
                End If
                PrevPos = NextPos + 1
                LineCount = LineCount + 1
                If LineCount = NbOfLines Then Exit Do
            Loop
        End If
        If LineCount = NbOfLines then Exit Do
        Mid$(BufferOverlap, 1, Int(BUFSIZE / 4)) = Mid$(strBuffer, Int(BUFSIZE / 4))
    Loop
    Close F
    GetFileLines = Split(data, vbCrLf)
End Function

To crunch even more computation time, it is highly advised to use fast string concatenation as explained here.

For instance the following function can be used:

Sub FastConcat(ByRef Dest As String, ByVal Source As String, ByRef ccOffset)
Dim L As Long, Buffer As Long
    Buffer = 50000
    L = Len(Source)
    If (ccOffset + L) >= Len(Dest) Then
        If L > Buffer Then
            Dest = Dest & Space$(L)
        Else
            Dest = Dest & Space$(Buffer)
        End If
    End If
    Mid$(Dest, ccOffset + 1, L) = Source
    ccOffset = ccOffset + L
End Sub

And then use the function as follows:

NbOfChars = 0
Do until...    
     FastConcat MyString, AddedString, NbOfChars
Loop
MyString = Left$(MyString,NbOfChars)
Community
  • 1
  • 1
JiB
  • 41
  • 7