PowerShell: Find unique values from multiple CSV files

Question

let's say that I have several CSV files and I need to check a specific column and find values that exist in one file, but not in any of the others. I'm having a bit of trouble coming up with the best way to go about it as I wanted to use Compare-Object and possibly keep all columns and not just the one that contains the values I'm checking.

So I do indeed have several CSV files and they all have a Service Code column, and I'm trying to create a list for each Service Code that only appears in one file. So I would have "Service Codes only in CSV1", "Service Codes only in CSV2", etc.

Based on some testing and a semi-related question, I've come up with a workable solution, but with all of the nesting and For loops, I'm wondering if there is a more elegant method out there.

Here's what I do have:

$files = Get-ChildItem -LiteralPath "C:\temp\ItemCompare" -Include "*.csv"
$HashList = [System.Collections.Generic.List[System.Collections.Generic.HashSet[String]]]::New()
For ($i = 0; $i -lt $files.Count; $i++){
    $TempHashSet = [System.Collections.Generic.HashSet[String]]::New([String[]](Import-Csv $files[$i])."Service Code")
    $HashList.Add($TempHashSet)
}

$FinalHashList = [System.Collections.Generic.List[System.Collections.Generic.HashSet[String]]]::New()
For ($i = 0; $i -lt $HashList.Count; $i++){
    $UniqueHS = [System.Collections.Generic.HashSet[String]]::New($HashList[$i])
    For ($j = 0; $j -lt $HashList.Count; $j++){
        #Skip the check when the HashSet would be compared to itself
        If ($j -eq $i){Continue}
        $UniqueHS.ExceptWith($HashList[$j])
    }
    $FinalHashList.Add($UniqueHS)
}

It seems a bit messy to me using so many different .NET references, and I know I could make it cleaner with a tag to say using namespace System.Collections.Generic, but I'm wondering if there is a way to make it work using Compare-Object which was my first attempt, or even just a simpler/more efficient method to filter each file.

Which one of the CSVs you have is the "Reference CSV", or are you looking to compare all of them and find unique values ? Also, are you looking for efficiency or elegance, this is personal opinion but `hashshet` is a lot more elegant and efficient than `Compare-Object` (again, personal preference) — Santiago Squarzon, Jan 25 '22 at 20:30
Can there be duplicates of service codes within a single file? — zett42, Jan 25 '22 at 20:30
There most likely won't be duplicates within a single file, 99.9% of the time. @SantiagoSquarzon - I'm trying to compare each one to all others. So if I used `Compare-Object` it got pretty nested and probably doesn't qualify as elegant by any definition — immobile2, Jan 25 '22 at 20:50
I don't see you having `[System.StringComparer]::OrdinalIgnoreCase` on your `hashset`, is that something you should worry about ? (having too codes that might have the same value except for their difference in case sensitivity) — Santiago Squarzon, Jan 25 '22 at 21:00
I could toss case logic in as a safety check, but everything should be converted to upper case in the source files so that isn't a big concern. The `HashSet` process isn't too bad and it rather quick, but now I'm starting to think I might to export the results including columns other than "Service Code" so before I converted everything to `HashTables` with `PSCustomObject`s as the keys, I thought I would check if there is a better approach entirely as I couldn't get `Compare-Object` to be happy — immobile2, Jan 25 '22 at 21:07

zett42 · Accepted Answer · 2022-01-25T21:44:30.363

I believe I found an "elegant" solution based on Group-Object, using only a single pipeline:

# Import all CSV files. 
Get-ChildItem $PSScriptRoot\csv\*.csv -File -PipelineVariable file | Import-Csv | 

    # Add new column "FileName" to distinguish the files.
    Select-Object *, @{ label = 'FileName'; expression = { $file.Name } } |

    # Group by ServiceCode to get a list of files per distinct value. 
    Group-Object ServiceCode |

    # Filter by ServiceCode values that exist only in a single file.
    # Sort-Object -Unique takes care of possible duplicates within a single file.
    Where-Object { ( $_.Group.FileName | Sort-Object -Unique ).Count -eq 1 } |

    # Expand the groups so we get the original object structure back.
    ForEach-Object Group |

    # Format-Table requires sorting by FileName, for -GroupBy.
    Sort-Object FileName |

    # Finally pretty-print the result.
    Format-Table -Property ServiceCode, Foo -GroupBy FileName

Test Input

a.csv:

ServiceCode,Foo
1,fop
2,fip
3,fap

b.csv:

ServiceCode,Foo
6,bar
6,baz
3,bam
2,bir
4,biz

c.csv:

ServiceCode,Foo
2,bla
5,blu
1,bli

Output

   FileName: b.csv    

ServiceCode Foo       
----------- ---       
4           biz       
6           bar       
6           baz       

   FileName: c.csv    

ServiceCode Foo       
----------- ---       
5           blu

Looks correct to me. The values 1, 2 and 3 are duplicated between multiple files, so they are excluded. 4, 5 and 6 exist only in single files, while 6 is a duplicate value only within a single file.

Understanding the code

Maybe it is easier to understand how this code works, by looking at the intermediate output of the pipeline produced by the Group-Object line:

Count Name                      Group
----- ----                      -----
    2 1                         {@{ServiceCode=1; Foo=fop; FileName=a.csv}, @{ServiceCode=1; Foo=bli; FileName=c.csv}}
    3 2                         {@{ServiceCode=2; Foo=fip; FileName=a.csv}, @{ServiceCode=2; Foo=bir; FileName=b.csv}, @{ServiceCode=2; Foo=bla; FileName=c.csv}}
    2 3                         {@{ServiceCode=3; Foo=fap; FileName=a.csv}, @{ServiceCode=3; Foo=bam; FileName=b.csv}}
    1 4                         {@{ServiceCode=4; Foo=biz; FileName=b.csv}}
    1 5                         {@{ServiceCode=5; Foo=blu; FileName=c.csv}}
    2 6                         {@{ServiceCode=6; Foo=bar; FileName=b.csv}, @{ServiceCode=6; Foo=baz; FileName=b.csv}}

Here the Name contains the unique ServiceCode values, while Group "links" the data to the files.

From here it should already be clear how to find values that exist only in single files. If duplicate ServiceCode values within a single file wouldn't be allowed, we could even simplify the filter to Where-Object Count -eq 1. Since it was stated that dupes within single files may exist, we need the Sort-Object -Unique to count multiple equal file names within a group as only one.

`Group-Object` - I had originally thought about doing something similar to this when I came across `Compare-Object`. By the time I gave up last night to get `Compare` to work though, I went straight for a `hash set` and completely forgot about `Group-Object`! I'll give this a shot, but it looks promising and doesn't loop through all the files multiple times — immobile2, Jan 25 '22 at 21:33
@immobile2 This was an interesting exercise for me. I hope it works out for you. — zett42, Jan 25 '22 at 21:40
Yeah this is perfect actually - plenty to expand on if needed, but a great approach. I usually don't even try to keep things in a single pipeline which probably explains why I've never even used `-PipelineVariable` before. Your explanation on the intermediary output is extremely helpful as well. `Group-Object` is definitely a bit underrated considering how powerful it can be — immobile2, Jan 25 '22 at 21:52
This is a bit off topic, but you obviously know what you're doing and I'm not sure how to move this to a chat. 1) I always seem to have trouble with `Add-Member` and whether I need to use `ForEach-Object{$_|Add-Member}` or I can just pipe directly into it. Why did you use `Select-Object` for a calculated property vs. `Add-Member`? — immobile2, Jan 27 '22 at 18:23
@immobile2 A big difference is that `Add-Member` modifies the object in-place, whereas `Select-Object` creates a copy before adding the member(s). In our case it propably wouldn't matter much, but in general I like to keep the original objects unmodified, which makes the program easier to reason about. So I rarely use `Add-Member` at all. — zett42, Jan 27 '22 at 19:05

score 2 · Answer 2 · answered Jan 26 '22 at 09:20

2

It is not completely clear what you expect as an output.
If this is just the ServiceCodes that intersect then this is actually a duplicate with:

But taking that you actually want the related object and files, you might use this approach:

$HashTable = @{}
ForEach ($File in Get-ChildItem .\*.csv) {
    ForEach ($Object in (Import-Csv $File)) {
        $HashTable[$Object.ServiceCode] = $Object |Select-Object *,
            @{ n='File'; e={ $File.Name } },
            @{ n='Count'; e={ $HashTable[$Object.ServiceCode].Count + 1 } }
    }
}
$HashTable.Values |Where-Object Count -eq 1

answered Jan 26 '22 at 09:20

iRon

20,463
10
53
79

1

This looks nice and succinct and is propably faster than my long pipeline. The code currently does not output a ServiceCode value when it is duplicated only in a single file. This could be solved by storing the file names in a `HashSet` instead of incrementing a counter. Then use `HashSet.Count` in the `Where-Object` condition. – zett42 Jan 26 '22 at 12:21
1

Unfortunately I didn't find those related questions when I was looking for a method to the madness, but I do think my exercise is essentially a duplicate of those since what I want as output wasn't even fully clear to me when I started the project. At first I just wanted the list of intersecting values (or those that don't intersect I suppose), but realized that returning the full objects and finishing by writing them to a file would be more helpful than just the values. Thanks for linking those related answers which are superb and for tackling my question! – immobile2 Jan 27 '22 at 16:52

Santiago Squarzon · Answer 3 · 2022-01-26T18:18:53.350

Here is my take on this fun exercise, I'm using a similar approach as yours with the HashSet but adding [System.StringComparer]::OrdinalIgnoreCase to leverage the .Contains(..) method:

using namespace System.Collections.Generic

# Generate Random CSVs:
$charset = 'abABcdCD0123xXyYzZ'
$ran = [random]::new()
$csvs = @{}
foreach($i in 1..50) # Create 50 CSVs for testing
{
    $csvs["csv$i"] = foreach($z in 1..50) # With 50 Rows
    {
        $index = (0..2).ForEach({ $ran.Next($charset.Length) })
        
        [pscustomobject]@{
            ServiceCode = [string]::new($charset[$index])
            Data = $ran.Next()
        }
    }
}

# Get Unique 'ServiceCode' per CSV:
$result = @{}
foreach($key in $csvs.Keys)
{
    # Get all unique `ServiceCode` from the other CSVs
    $tempHash = [HashSet[string]]::new(
        [string[]]($csvs[$csvs.Keys -ne $key].ServiceCode),
        [System.StringComparer]::OrdinalIgnoreCase
    )
    # Filter the unique `ServiceCode`
    $result[$key] = foreach($line in $csvs[$key])
    {
        if(-not $tempHash.Contains($line.ServiceCode))
        {
            $line
        }
    }
}

# Test if the code worked,
# If something is returned from here means it didn't work
foreach($key in $result.Keys)
{
    $tmp = $result[$result.Keys -ne $key].ServiceCode
    foreach($val in $result[$key])
    {
        if($val.ServiceCode -in $tmp)
        {
            $val
        }
    }
}

@immobile2 omg you're right, not sure why `$result = @{}` is not here. I might have missed it when copy-pasting... I have updated. Sorry for that. — Santiago Squarzon, Jan 26 '22 at 18:19

score 0 · Answer 4 · answered Jan 25 '22 at 21:29

0

i was able to get unique items as follow

# Get all items of CSVs in a single variable with adding the file name at the last column
$CSVs = Get-ChildItem "C:\temp\ItemCompare\*.csv" | ForEach-Object {
    $CSV = Import-CSV -Path $_.FullName
    $FileName = $_.Name
    $CSV | Select-Object *,@{N='Filename';E={$FileName}}
}
Foreach($line in $CSVs){
$ServiceCode = $line.ServiceCode
$file = $line.Filename
if (!($CSVs | where {$_.ServiceCode -eq $ServiceCode -and $_.filename -ne $file})){
$line
}
}

answered Jan 25 '22 at 21:29

Mahmoud Moawad

697
7
14

1

Thanks! This approach definitely works, but it seems to be quite a bit slower than what I was clocking with `Compare-Object`, `Group-Object`, or using a ton of hash tables and hash sets. Which makes sense because I was leaning toward a dictionary/hash table of some sort for the speed of lookups and compares, with the number of files that I have in the folder the `Where-Object` filtering here seems to bring things to a near standstill – immobile2 Jan 25 '22 at 21:56

PowerShell: Find unique values from multiple CSV files

4 Answers4

Test Input

Output

Understanding the code