0

I have a file that looks like the following;

- Visitor ID,Revenue,Channel,Flight
- 1234,100,Email,BA123
- 2345,200,PPC,BA112
- 456,150,Email,BA456

I need to produce a file that contains;

The count of distinct Visitor IDs (3)
The total revenue (450)
The count of each Channel
Email 2
PPC 2
The count of each Flight
BA123 1
BA112 1
BA456 1

So far I have the following code, however when executing this on the 350MB file, it takes too long and in some cases breaks the memory limit. As I have to run this function on multiple columns, it is going through the file many times. I ideally need to do this in one file pass.

$file = 'log.txt'

function GroupBy($columnName)
{
    $objects = Import-Csv -Delimiter "`t" $file | Group-Object $columnName |
       Select-Object @{n=$columnName;e={$_.Group[0].$columnName}}, Count

      for($i=0;$i -lt $objects.count;$I++) {
     $line += $columnName +"|"+$objects[$I]."$columnName" +"|Count|"+ $objects[$I].'Count' + $OFS

    }
    return $line
}

$finalOutput += GroupBy "Channel"
$finalOutput += GroupBy "Flight"


Write-Host $finalOutput

Any help would be much appreciated.

Thanks,

Craig

Dee_wab
  • 1,171
  • 1
  • 10
  • 23
  • Have you looked at ? https://stackoverflow.com/questions/13050408/import-csv-groupby-sum?rq=1 – JRB May 23 '18 at 08:10

2 Answers2

2

The fact that your are importing the CSV again for each column is what is killing your script. Try to do the loading once, then re-use the data. For example:

$data = Import-Csv .\data.csv

$flights = $data | Group-Object Flight -NoElement | ForEach-Object {[PsCustomObject]@{Flight=$_.Name;Count=$_.Count}}
$visitors = ($data | Group-Object "Visitor ID" | Measure-Object).Count
$revenue = ($data | Measure-Object Revenue -Sum).Sum
$channel = $data | Group-Object Channel -NoElement | ForEach-Object {[PsCustomObject]@{Channel=$_.Name;Count=$_.Count}}

You can display the data like this:

"Revenue : $revenue"
"Visitors: $visitors"
$flights | Format-Table -AutoSize
$channel | Format-Table -AutoSize
boxdog
  • 7,894
  • 2
  • 18
  • 27
  • Thanks for the reply - I did try your code out on a 350MB file. There is now 12 GB of memory being used and what I can only imagine to be a lot of paging going on. Has Powershell always been this awful ?! – Craig Fitches May 23 '18 at 09:37
  • PowerShell is based on .NET Framework, which has it's own garbage collection that runs in the background. Objects are not always immediately disposed of (thus freeing memory). You can force .NET to garbage collect now by using `[GC]::Collect()`, though it [isn't usually recommended](https://stackoverflow.com/questions/478167/when-is-it-acceptable-to-call-gc-collect). – boxdog May 23 '18 at 09:42
  • When I hit greater amounts of data I find that Group-By can be slow and memory consuming. In those cases I usually just use hash maps to get exactly what I am looking for. – Zerqent May 23 '18 at 09:59
  • Exception of type 'System.OutOfMemoryException' was thrown. At C:\Users\fitches\Documents\Clients\TUI\GMP Data Summary\gmp.ps1:19 char:1 + $visitors = ($data | Group-Object "CMid Visitor" | Measure-Object).Co ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : OperationStopped: (:) [], OutOfMemoryException + FullyQualifiedErrorId : System.OutOfMemoryException – Craig Fitches May 23 '18 at 10:01
0

This will probably work - using hashmaps.

  • Pros: It will be faster/use less memory.
  • Cons: It is less readable by far than Group-Object, and requires more code.
  • Make it even less memory-hungry: Read the CSV-file line by line

    $data = Import-CSV -Path "C:\temp\data.csv" -Delimiter ","
    $DistinctVisitors = @{}
    $TotalRevenue = 0
    $ChannelCount = @{}
    $FlightCount = @{}
    
    $data | ForEach-Object {
        $DistinctVisitors[$_.'Visitor ID'] = $true
        $TotalRevenue += $_.Revenue
    
        if (-not $ChannelCount.ContainsKey($_.Channel)) {
            $ChannelCount[$_.Channel] = 0
        }
        $ChannelCount[$_.Channel] += 1
    
        if (-not $FlightCount.ContainsKey($_.Flight)) {
            $FlightCount[$_.Flight] = 0
        }
        $FlightCount[$_.Flight] += 1
    }
    
    $DistinctVisitorsCount = $DistinctVisitors.Keys | Measure-Object | Select-Object -ExpandProperty Count
    
    Write-Output "The count of distinc Visitor IDs $DistinctVisitorsCount"
    Write-Output "The total revenue $TotalRevenue"
    Write-Output "The Count of each Channel"
    $ChannelCount.Keys | ForEach-Object {
        Write-Output "$_ $($ChannelCount[$_])"
    }
    Write-Output "The count of each Flight"
    $FlightCount.Keys | ForEach-Object {
        Write-Output "$_ $($FlightCount[$_])"
    }
    
Zerqent
  • 483
  • 3
  • 9