3

I have a .txt with around 6gb of data. Fields delimited by semicolons.

I need to check one of the fields line by line against a prebuild dictionary and if there is a match copy all the fields of the respective line into a 2 dimension array.

Currently this is the relevant part of code (declarations and functions ommitted. Not at the scope of this question):

Set hbDict = dict_HB(hb) ''--this returns a dictionary from a function for comparison

Set FSO = CreateObject("scripting.filesystemobject")
Set myFile = FSO.OpenTextFile(sPath & sFilename, ForReading)

'--This counts how many matches occur between txt and dictionary to redim the array:
Do While myFile.AtEndOfStream <> True
    textline = myFile.ReadLine
    arrLine = Split(textline, ";")
    If hbDict.exists(arrLine(3)) Then
        arrLimit = arrLimit + 1
    End If
Loop

Redim MyArray(1 to arrLimit, 1 to 31)

'--Loop again through the file, now actually adding to the redimmed array:
L = 1
Do While myFile.AtEndOfStream <> True
    textline = myFile.ReadLine
    arrLine = Split(textline, ";")
    If hbDict.exists(arrLine(3)) Then
        For c = 1 to 31
            MyArray(L,C) = arrLine(c-1)
        Next
        L = L + 1
    End If
Loop
myFile.Close
set FSO = nothing

'code continues...

First loop takes around 19 minutes. second a little more.

Already tried to open for append, but it crashes, maybe because i'm running at 4gb of RAM. Any way of loading the entire file at once, seems to crash the machine. Open for input does not read the entire file, so data is lost. The use of a collection in the first loop to avoid relooping the txt would be great if it could handle more than 256 entries... And off course, dinamically redim array inside the loop is out of question since it is a performance killer.

Is there any way to do it faster than this?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Musicodelic
  • 93
  • 1
  • 12
  • Have a look here https://stackoverflow.com/questions/11635526/return-csv-file-as-recordset – Sam Apr 11 '18 at 17:50
  • My idea would be to read the file only once. Read the line as you do now and add the whole line to a collection in case the conditon is true. – Storax Apr 11 '18 at 17:51
  • Use vb.net instead of vba. It's much faster. – braX Apr 11 '18 at 17:55
  • 1
    Hello braX, i can't use .NET since i'm on company's enviroment. Unfortunatelly, excel vba is the only tool i have at hands. – Musicodelic Apr 11 '18 at 17:57

1 Answers1

3

Change the first loop to

Dim colLines As Collection
Set colLines = New Collection
    Do While Not myFile.AtEndOfStream
        textline = myFile.ReadLine
        arrLine = Split(textline, ";")
        If hbDict.exists(arrLine(3)) Then
            'arrLimit = arrLimit + 1
            colLines.Add textline
        End If
    Loop

And the second loop

Dim i As Long
ReDim MyArray(1 To colLines.Count, 1 To 31)

For i = 1 To colLines.Count
    textline = colLines(i)
    arrLine = Split(textline, ";")
    If hbDict.exists(arrLine(3)) Then
        For c = 1 To 31
            MyArray(L, c) = arrLine(c - 1)
        Next
        L = L + 1
    End If
Next i

In such a way you only need to read the text file once. Because it's so big you won't be able to read the file completely into memory.

Storax
  • 11,158
  • 3
  • 16
  • 33
  • Already tried it myself and was confused since the debugger only shows 256 entries in the collection, but a little research shows that it seems to be a debugger limitation, not a collection's... is that right? Will test again and come back – Musicodelic Apr 11 '18 at 18:19
  • Yes, it seems that the debugger has a limitation. I just read an 800 MB file line by line into a collection and I can only see 256 entries in the debugger. The collection itself does have more than 600000 entries and I can access items with an index greater than 256. You always learn something. – Storax Apr 11 '18 at 18:28