Powershell - Remove Duplicate lines in TXT based on ID

Question

I have a TXT-File with thousands of lines. The number after the first Slash is the image ID. I want to delete all lines so that only one line remains for every ID. Which of the lines is getting killed doesn't matter.

I tried to pipe the TXT to a CSV with Powershell and work with the unique parameter. But it didnt work. Any ideas how I can iterate through the TXT and kill all lines, so that always only one line per unique ID remains? :/

Status Today

thumbnails/4000896042746/2021-08-17_4000896042746_small.jpg
thumbnails/4000896042746/2021-08-17_4000896042746_smallX.jpg
thumbnails/4000896042333/2021-08-17_4000896042746_medium.jpg
thumbnails/4000896042444/2021-08-17_4000896042746_hugex.jpg
thumbnails/4000896042333/2021-08-17_4000896042746_tiny.jpg

After the script

thumbnails/4000896042746/2021-08-17_4000896042746_small.jpg
thumbnails/4000896042333/2021-08-17_4000896042746_medium.jpg
thumbnails/4000896042444/2021-08-17_4000896042746_hugex.jpg

score 2 · Answer 1 · answered Oct 22 '21 at 09:57

2

If it concerns "TXT-File with thousands of lines", I would use the PowerShell pipeline for this because (if correctly setup) it will perform the same but uses far less memory. Performance improvements might actually be leveraged from using a HashTable (or a HashSet) which is based on a binary search (and therefore much faster then e.g. grouping).
^{(I am pleading to get an accelerated HashSet #16003 into PowerShell)}

$Unique = [System.Collections.Generic.HashSet[string]]::new() 
Get-Content .\InFile.txt |ForEach-Object {
    if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content .\OutFile.txt

answered Oct 22 '21 at 09:57

iRon

20,463
10
53
79

Thanks, @iRon. I did some testing for the reading of the file. Using `Get-Content -raw $fileIn` will speed it up even more. But `[IO.File]::ReadAllText($fileIn)` is twice as fast as using `Get-Content $fileIn`. Not that you didn't already know this. – Ste Aug 29 '22 at 21:52
That was wrong of me to suggest using the `-raw` switch as it doesn't work. I'll post examples of a speed test below using some examples. – Ste Aug 29 '22 at 23:16
Ignore the last comment as I got the `-raw` switch to work. It now comes in at 2.4 seconds vs. this method at 7.2 seconds for 250k lines. I've updated my answer below. But thanks iRon for the initial code. – Ste Aug 30 '22 at 21:32

Ste · Answer 2 · 2022-08-30T22:07:09.107

To add to iRons great answer, I've done a speed comparison on 5 different ways to do it using 250k lines of the OPs' example.

Using the Get-Content -raw read and write using Set-Content method is the fastest way to do it. At least in these examples, as it is nearly 3x faster than using Get-Content and Set-Content.

I was curious to see how the HashSet method stacked up against the System.Collections.ArrayList one. And as you can see from the result below for that it's not too dissimilar.

Edit note Got the -raw switch to work as it needed splitting by a new line.

$fileIn = "C:\Users\user\Desktop\infile.txt"
$fileOut = "C:\Users\user\Desktop\outfile.txt"

# All examples below tested with 250,000 lines
# In order from fastest to slowest

#
# EXAMPLE 1 (Fastest)
#
# [Finished in 2.4s]
# Using the -raw switch only with Get-Content
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = (Get-Content -raw $fileIn).Split([Environment]::NewLine,[StringSplitOptions]::None)

$fileInSplit |ForEach-Object {
  if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content $fileOut

#
# EXAMPLE 2 (2nd fastest)
#
# [Finished in 2.5s]
# Using the -raw switch with Get-Content
# Using [IO.File] for write only
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = (Get-Content -raw $fileIn).Split([Environment]::NewLine,[StringSplitOptions]::None)
$contentToWriteArr = New-Object System.Collections.ArrayList

$fileInSplit |ForEach-Object {
  if ($Unique.Add(($_.Split('/'))[-2])) { [void]$contentToWriteArr.Add($_) }
}
[IO.File]::WriteAllLines($fileOut, $contentToWriteArr)

# #
# EXAMPLE 3 (3rd fastest example)
#
# [Finished in 2.7s]
# Using [IO.File] for the read and write
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = ([IO.File]::ReadAllLines($fileIn)).Split([Environment]::NewLine,[StringSplitOptions]::None)
$contentToWriteArr = [Collections.Generic.HashSet[string]]::new()

$fileInSplit |ForEach-Object {
  if ($Unique.Add(($_.Split('/'))[-2])) { $contentToWriteArr.Add($_) | out-null }
}
[IO.File]::WriteAllLines($fileOut, $contentToWriteArr)

#
# EXAMPLE 4 (4th fastest example)
#
# [Finished in 2.8s]
# Using [IO.File] for the read only
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = ([IO.File]::ReadAllLines($fileIn)).Split([Environment]::NewLine,[StringSplitOptions]::None)

$fileInSplit |ForEach-Object {
  if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content $fileOut

#
# EXAMPLE 5 (5th fastest example)
#
# [Finished in 2.9s]
# Using [IO.File] for the read and write
# This is using a System.Collections.ArrayList instead of a HashSet
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = ([IO.File]::ReadAllLines($fileIn)).Split([Environment]::NewLine,[StringSplitOptions]::None)
$contentToWriteArr = New-Object System.Collections.ArrayList

$fileInSplit |ForEach-Object {
  if ($Unique.Add(($_.Split('/'))[-2])) { $contentToWriteArr.Add($_) | out-null }
}
[IO.File]::WriteAllLines($fileOut, $contentToWriteArr)

#
# EXAMPLE 6 (Slowest example) - As per iRons answer
#
# [Finished in 7.2s]
$Unique = [System.Collections.Generic.HashSet[string]]::new()
$fileInSplit = Get-Content $fileIn

$fileInSplit |ForEach-Object {
  if ($Unique.Add(($_.Split('/'))[-2])) { $_ }
} | Set-Content $fileOut

A few comments to your performance testing. **1.** Due to (disk) caching it is important that you open a new PowerShell Session for each test. **2.** If you testing native PowerShell, you should *not* assign the contents of the file to a variable (`$fileInSplit =`) as that will choke the pipeline (see my [actual answer](https://stackoverflow.com/a/69674862), you might also check the -low!- memory usage here) **3.** You not showing which part you test but to test against PowerShell you should measure the *whole* solution, see also [this answer](https://stackoverflow.com/a/59437162/1701026). — iRon, Aug 31 '22 at 06:12
The native PowerShell solution might still be slower but mind the **Note** in [**PowerShell scripting performance considerations**](https://learn.microsoft.com/powershell/scripting/dev-cross-plat/performance/script-authoring-considerations): *Many of the techniques described here are not idiomatic PowerShell and may reduce the readability of a PowerShell script. Script authors are advised to use idiomatic PowerShell unless performance dictates otherwise.* — iRon, Aug 31 '22 at 06:18
The native PowerShell fn is the fastest. That's with the added `-raw` switch. I was only trying different methods. I test in Sublime text and it fires up a brand new instance every time. — Ste, Sep 01 '22 at 20:13

score 0 · Answer 3 · answered Oct 22 '21 at 07:17

You can group by custom property. So if you know what's your ID then you just have to group by that and then take the first element from the group:

$content = Get-Content "path_to_your_file";

$content = ($content | group { ($_ -split "/")[1] } | % { $_.Group[0] });

$content | Out-File "path_to_your_result_file"

score 0 · Accepted Answer · answered Oct 22 '21 at 07:17

Here a solution that uses a calculated property to create an object that contains the ID and the FileName. Then I group the result based on the ID, iterate over each group and select the first FileName:

$yourFileList = @(
    'thumbnails/4000896042746/2021-08-17_4000896042746_small.jpg',
    'thumbnails/4000896042746/2021-08-17_4000896042746_smallX.jpg',
    'thumbnails/4000896042333/2021-08-17_4000896042746_medium.jpg',
    'thumbnails/4000896042444/2021-08-17_4000896042746_hugex.jpg',
    'thumbnails/4000896042333/2021-08-17_4000896042746_tiny.jpg'
)

$yourFileList | 
Select-Object @{Name = "Id"; Expression = { ($_ -split '/')[1] } }, @{ Name = 'FileName'; Expression = { $_ } } | 
Group Id | 
ForEach-Object { $_.Group[0].FileName }

Powershell - Remove Duplicate lines in TXT based on ID

4 Answers4