To add to iRons great answer, I've done a speed comparison on 5 different ways to do it using 250k lines of the OPs' example.
Using the Get-Content -raw
read and write using Set-Content
method is the fastest way to do it. At least in these examples, as it is nearly 3x faster than using Get-Content
and Set-Content
.
I was curious to see how the HashSet
method stacked up against the System.Collections.ArrayList
one. And as you can see from the result below for that it's not too dissimilar.
Edit note Got the -raw
switch to work as it needed splitting by a new line.
$fileIn = "C:\Users\user\Desktop\infile.txt"
$fileOut = "C:\Users\user\Desktop\outfile.txt"
# All examples below tested with 250,000 lines
# In order from fastest to slowest
#
# EXAMPLE 1 (Fastest)
#
# [Finished in 2.4s]
# Using the -raw switch only with Get-Content
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = (Get-Content -raw $fileIn).Split([Environment]::NewLine,[StringSplitOptions]::None)
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content $fileOut
#
# EXAMPLE 2 (2nd fastest)
#
# [Finished in 2.5s]
# Using the -raw switch with Get-Content
# Using [IO.File] for write only
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = (Get-Content -raw $fileIn).Split([Environment]::NewLine,[StringSplitOptions]::None)
$contentToWriteArr = New-Object System.Collections.ArrayList
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { [void]$contentToWriteArr.Add($_) }
}
[IO.File]::WriteAllLines($fileOut, $contentToWriteArr)
# #
# EXAMPLE 3 (3rd fastest example)
#
# [Finished in 2.7s]
# Using [IO.File] for the read and write
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = ([IO.File]::ReadAllLines($fileIn)).Split([Environment]::NewLine,[StringSplitOptions]::None)
$contentToWriteArr = [Collections.Generic.HashSet[string]]::new()
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $contentToWriteArr.Add($_) | out-null }
}
[IO.File]::WriteAllLines($fileOut, $contentToWriteArr)
#
# EXAMPLE 4 (4th fastest example)
#
# [Finished in 2.8s]
# Using [IO.File] for the read only
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = ([IO.File]::ReadAllLines($fileIn)).Split([Environment]::NewLine,[StringSplitOptions]::None)
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content $fileOut
#
# EXAMPLE 5 (5th fastest example)
#
# [Finished in 2.9s]
# Using [IO.File] for the read and write
# This is using a System.Collections.ArrayList instead of a HashSet
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = ([IO.File]::ReadAllLines($fileIn)).Split([Environment]::NewLine,[StringSplitOptions]::None)
$contentToWriteArr = New-Object System.Collections.ArrayList
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $contentToWriteArr.Add($_) | out-null }
}
[IO.File]::WriteAllLines($fileOut, $contentToWriteArr)
#
# EXAMPLE 6 (Slowest example) - As per iRons answer
#
# [Finished in 7.2s]
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = Get-Content $fileIn
$fileInSplit |ForEach-Object {
if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content $fileOut