0

I have a huge .csv file with over 4.5 million lines. As this is to big for Excel I need to search the .csv file for any entries from each user and then sum them but the sum needs to be done for a specific month.

Excel
USER Month total value
AAH Febuary 2010 1014

CSV
"USER","DATE_TIME","NUMBER"
"AAH","2010-03-18T17:35:01.000Z","410.0"
"ABH","2011-01-24T09:43:01.000Z","336.0"
"AAH","2010-03-18T19:25:01.000Z","114.0"
"BhC","2012-06-24T03:45:01.000Z","336.0"
"AAH","2010-03-20T19:30:01.000Z","490.0"

Can you help me with a solution ?

pnuts
  • 58,317
  • 11
  • 87
  • 139
Tupolev
  • 11
  • 2
  • 1
    This can be done much more easily using [MS Query](http://www.exceluser.com/explore/msquery2_1.htm) on the CSV instead (assuming you know SQL, it should be faster than any VBA solution). – jmac Aug 21 '13 at 01:17

1 Answers1

0

You can do it with (tweak to taste) the below. It works on your test data (duplicated to 5.5 million rows or around 230MB it takes about 30 secs on my laptop. No doubt, if performance is vital, it can be improved but it is probably sufficiently fast for your purposes).

Option Explicit

Sub GetData()

    Dim fso As Object
    Dim fs As Object
    Dim results As Collection
    Dim arr
    Dim i As Long
    Dim monthOfInterest As Integer
    Dim recordMonth As Date
    Dim recordUser As String
    Dim recordValue As Variant
    Dim recordKey As String
    Dim result As Variant
    Dim str As String, splitStr() As String
    Dim ws As Worksheet

    Set fso = CreateObject("Scripting.FileSystemObject")

    Set results = New Collection

    'enter your path here or use something like FileDialog
    Set fs = fso.OpenTextFile("C:\test.csv", ForReading, False, TristateFalse) 'TristateTrue if Unicode
    monthOfInterest = 3

    If not fs.AtEndOfStream Then fs.SkipLine 'skip past headers at top of CSV

    Do While Not fs.AtEndOfStream
        splitStr = Split(fs.ReadLine, ",")
        If fs.Line Mod 10000 = 0 Then
            Application.StatusBar = "Line " & fs.Line
            DoEvents
        End If
        recordMonth = DateSerial( _
                        Mid(splitStr(1), 2, 4), _
                        Mid(splitStr(1), 7, 2), 1)
        If month(recordMonth) = monthOfInterest Then
            recordUser = Mid(splitStr(0), 2, Len(splitStr(0)) - 2)
            recordValue = CDec(Mid(splitStr(2), 2, Len(splitStr(2)) - 2))
            recordKey = recordUser & "|" & Format(recordMonth, "YYYY-MM")
            On Error Resume Next
            result = results(recordKey)
            If Err.Number <> 5 Then 'key exists
                results.Remove recordKey
                recordValue = recordValue + result(2)
            End If
            On Error GoTo 0
            results.Add Array(recordUser, recordMonth, recordValue), recordKey
        End If
    Loop
    fs.Close

    Application.StatusBar = "Outputting..."

    'Process results and dump to worksheet
    If results.Count > 0 Then
        Set ws = ActiveWorkbook.Worksheets.Add
        ReDim arr(0 To results.Count, 0 To 2)
        arr(0, 0) = "User"
        arr(0, 1) = "Month"
        arr(0, 2) = "Total"
        For i = 1 To UBound(arr, 1)
            arr(i, 0) = results(i)(0)
            arr(i, 1) = results(i)(1)
            arr(i, 2) = results(i)(2)
        Next i
        ws.Range(ws.Cells(1, 1), ws.Cells(1 + UBound(arr, 1), 1 + UBound(arr, 2))).Value = arr
    End If

    Application.StatusBar = ""

End Sub
Cor_Blimey
  • 3,260
  • 1
  • 14
  • 20
  • My excel down at the moment but I have a request for you. This project im working with is suppose to be don and I still have problems with it. Would you consider helping me with the rest of the solution. I will pay you for your time? – Tupolev Aug 20 '13 at 20:24
  • @Tupolev sorry, but I am employed full-time. A bit of coding is just for fun. Good luck though. There are various professional Microsoft Excel MVPs out there who freelance -> google them – Cor_Blimey Aug 20 '13 at 20:26
  • Well I will explain it shortly and maybe you find this alot easier then I do. I have the large CSV file with user points earned and I also have two smaller CSV files, one with points rewarded and on with points being spend. No the hard part is that points rewarede or earne ather april 1 2011 will expire after thee years. So I need to find out how many points are being used and hwo many are looking to expire. – Tupolev Aug 20 '13 at 20:33
  • @Tupolev with all due respect but that is next to useless. What error? What line? With what data does the error occur? I assure you the sub runs fine with your test data duplicated to 5 million rows so either the test data doesn't fully represent the structure of your input data or there is some other issue, both of which need more information than the banal and uninformative "I get an error" to have any idea what is the actual issue...! – Cor_Blimey Aug 21 '13 at 23:03