0

I need, if possible, to remove, in place, duplicate lines in a path with multiple text files, in powershell.

I've found a way to get the list of lines:

Get-Content "$path\*.*" | Group-Object | Where-Object { $_.Count -gt 1 } | Select -ExpandProperty Name

Now I think that a foreach loop will useful but, I don't know how to handle the remove action in place...

Can Someone help me please?

EDIT: I've changed the Title of question due to avoid misunderstanding!

EDIT 2 (based on Olaf hint):

PS C:\Users\Robbi> $mypath = "F:\DATA\Urls_CP"
PS C:\Users\Robbi> Get-ChildItem -Path $mypath -Filter * |
>>     ForEach-Object{
>>         $Content =
>>         Get-Content -Path $_.FullName | Sort-Object -Unique
>>         $Content | Out-File -FilePath $_.FullName
>>     }

PS C:\Users\Robbi> Get-Content $mypath\* | Select-String "https://httpd.apache.org/docs/2.4/mod/mod_md.html"

https://httpd.apache.org/docs/2.4/mod/mod_md.html
https://httpd.apache.org/docs/2.4/mod/mod_md.html

But something has changed, I've copied the original folder named "Urls", and ran your code on copied folder "Urls_CP"; "Urls_CP" size is about 200kb bigger than original "Urls"!

Just for info, each file is a, powershell manipulated, "access.log" of Squid proxy from linux vm, but I've checked encoding and the presence of "strange" chars with notepad++. (I haven't access to linux shell)

This is an extract of one files inside the "Urls" folder:

https://community.checkpoint.com/t5/API-CLI-Discussion-and-Samples/can-anybody-let-me-know-how-can-we-import-policy-rules-via-csv/td-p/20839
https://community.checkpoint.com/t5/API-CLI-Discussion-and-Samples/Python-tool-for-exporting-importing-a-policy-package-or-parts-of/td-p/41100
https://community.checkpoint.com/t5/General-Management-Topics/R80-10-API-bug-fallback-to-quot-SmartCenter-Only-quot-after/m-p/5074
https://github.com/CheckPointSW/cp_mgmt_api_python_sdk
https://github.com/CheckPointSW/cpAnsible/issues/2
https://github.com/CheckPointSW/ExportImportPolicyPackage/issues
https://stackoverflow.com/questions/15031694/installing-python-packages-from-local-file-system-folder-to-virtualenv-with-pip
https://stackoverflow.com/questions/24627525/fatal-error-in-launcher-unable-to-create-process-using-c-program-files-x86
https://stackoverflow.com/questions/25749621/whats-the-difference-between-pip-install-and-python-m-pip-install
https://stackoverflow.com/questions/42494229/how-to-pip-install-a-local-python-package

EDIT 3:

Please forgive me, I'll try to explain me better!

I would maintain the structure of "Urls" folder, that contains multiple files; I would remove (or replace with "$ null") the duplicates "on an all-files basis" but preserving each file in the folder, ie: not one big file with all http address inside! In the EDIT 2 I've show to Olaf that the string "https://httpd.apache.org/docs/2.4/mod/mod_md.html" are still duplicated, because it is present in "$mypath\file1.txt" and in file "$mypath\file512.txt"! I've understand that Olaf's code check for duplicates "on a per-file basis" (thanks to @Lee_Dailey I've got wath is unclear in my question!)

EDIT 4:

$SourcePath = 'F:\DATA\Urls_CP'
$TargetPath = 'F:\DATA\Urls_CP\DeDupe'

$UrlList = Get-ChildItem -Path $SourcePath -Filter *.txt |
    ForEach-Object {
        $FileName = $_.BaseName
        $FileLWT = (Get-ItemProperty $_.FullName).LastWriteTime
        Get-Content -Path $_.FullName -Encoding default |
            ForEach-Object {
                [PSCustomObject]@{
                    URL = $_
                    File = $FileName
                    LWT = $FileLWT
                }
            }
    }

$UrlList | 
    Sort-Object -Property URL -Unique |
        ForEach-Object {
            $TargetFile = Join-Path -Path $TargetPath -ChildPath ($_.File + '.txt')
            $_.URL | Out-File -FilePath $TargetFile -Append -Encoding default
            Set-ItemProperty $TargetFile -Name LastWriteTime -Value $_.LWT
        }
ilRobby
  • 69
  • 2
  • 10
  • Could you please share some samples of the data you're dealing with and the expected result? Did you try to search for a solution? [Powershell remove duplicate lines in text files](https://stackoverflow.com/search?q=[powershell]+remove+duplicate+lines+in+text+files) – Olaf May 15 '20 at 23:27
  • what do you mean by "duplicate lines in a path"? do you mean dupe lines in a _file_ found in a directory path? do you mean any line that exists more than once in the lines found in ALL the files on a directory path? – Lee_Dailey May 15 '20 at 23:52
  • @Olaf Hi Olaf, all files contains strings of http/s url, one per line. I've searched, but I've found only solution that sends contents of all files to one file... my goal it's to remove, or replace with a empty line, in-place, each file that contains a duplicate. – ilRobby May 16 '20 at 00:15
  • @Lee_Dailey Hi Lee, yes, as I wrote in the example code, with path I mean a directory that contains a lot of text files, and I would catch the duplicate lines and remove/replace, in-place, any duplicate to mantain only one unique lines. – ilRobby May 16 '20 at 00:20
  • @ilRobby - you want to remove dupe lines _in one file at a time_, not lines that are all files as a group? – Lee_Dailey May 16 '20 at 01:00
  • @Lee_Dailey Lee, my approach was a try, doesn't mean that is the right one! :) – ilRobby May 16 '20 at 02:10
  • 1
    @ilRobby - i'm sorry that i failed to make myself clear. your description - to me - is unclear and can mean "remove dupes on a per-file basis" OR "remove dupes on an all-files basis". the two require different approaches. – Lee_Dailey May 16 '20 at 06:20
  • @Lee_Dailey - Forgive me Lee_Dailey, the fault is mine, my english is perhaps worse than my powershell scripting!! – ilRobby May 16 '20 at 18:17
  • @ALL - In EDIT 3 I've tried to explain me better than I have done so far! – ilRobby May 16 '20 at 18:18
  • @ilRobby - neat! thank you for the clarification. [*grin*] it looks like `Olaf` has the Answer to your Question now. glad to know that you got it working as needed. – Lee_Dailey May 16 '20 at 22:16

1 Answers1

0

Your explanation from Edit #3 makes even less sense I think. What is this task actually for?

$SourcePath = 'F:\DATA\Urls_CP'
$TargetPath = 'F:\DATA\Urls_CP\DeDupe'

$UrlList = Get-ChildItem -Path $SourcePath -Filter *.log |
    ForEach-Object {
        $FileName = $_.BaseName
        Get-Content -Path $_.FullName -Encoding default |
            ForEach-Object {
                [PSCustomObject]@{
                    URL = $_
                    File = $FileName
                }
            }
    }

$UrlList | 
    Sort-Object -Property URL -Unique |
        ForEach-Object {
            $TargetFile = Join-Path -Path $TargetPath -ChildPath ($_.File + '.log')
            $_.URL | Out-File -FilePath $TargetFile -Append -Encoding default
        }

The target folder has to exist in advance.

Olaf
  • 4,690
  • 2
  • 15
  • 23
  • Thanks Olaf! Please, see EDIT 2 of question. – ilRobby May 16 '20 at 01:55
  • @ilRobby Then, if this is not what you want (uniqify the lines per file), please edit your question and show us what your desired output will be. A new text file containing all lines from all *.log files deduped perhaps? – Theo May 16 '20 at 10:21
  • Changed the code. Try it now and *play* a little bit with the encoding. – Olaf May 16 '20 at 11:41
  • @Olaf - Sorry Olaf, I badly explained myself, your first version was more closely to my needs! Please look at "EDIT 3" note! – ilRobby May 16 '20 at 18:25
  • Hi Olaf, I think that this is the answer! Audit dep. ask me to make squid logs "more readable", an output formatted with certain characteristics, to be subsequently imported into a big data engine... I don't know the real purpose! I've tried to mantein the original "last write time" of each file and seems works, I've followed your example code but, can you check EDIT 4 if I've choose the correct way? Just one last tip Olaf, is there a method to keep the removed urls? Just to be sure that the script works properly, without compare each file individually... – ilRobby May 21 '20 at 12:10