1

file.txt:

Hello How are you 
what are you are doing?
This is great

Final file output:

Hello How are you
This is great

Here, I wanted to remove the whole line when a word is repeated twice or more (in line 2 "are" repeated twice and so, I want to remove) in one line using either a batch script or power shell script.

pppery
  • 3,731
  • 22
  • 33
  • 46
  • my understanding was reg expressions count words in whole file and not by line – reach4thesky Jan 31 '20 at 18:14
  • Does this answer your question? [How do I find and remove duplicate lines from a file using Regular Expressions?](https://stackoverflow.com/questions/1573361/how-do-i-find-and-remove-duplicate-lines-from-a-file-using-regular-expressions) – Andre Nuechter Jan 31 '20 at 18:17
  • 1
    no, I want to delete a line when a word appears more than once in one line. in above example, the word "are" repeated 2 times and so I want to delete the whole line. the link you gave me, deletes if the whole is a duplicate. – reach4thesky Jan 31 '20 at 18:20
  • Then maybe [this](https://stackoverflow.com/questions/18768727/notepad-deleting-lines-containing-duplicate-words) will – Andre Nuechter Jan 31 '20 at 18:22
  • @AndreNuechter: Your second link too pertains to a different problem (and it's not a PowerShell / cmd question). – mklement0 Feb 01 '20 at 17:10

5 Answers5

1

Using PowerShell's switch statement with the -Regex option enables a concise solution:

# Create a sample file
@'
Hello How are you 
what are you are doing?
This is great
'@ > file.txt

switch -Regex -File file.txt {
  '\b(\w+)\b.+\1' { continue } # line with duplicate words -> skip
  default { $_ } # duplicate-free line -> output
}

To send the above to a file, wrap the entire switch statement in & { ... } and pipe to
Set-Content.

The regex (regular expression) above uses a backreference (\1) to the first capture group ((...)) to match a previously matched word (\w+) again (and uses word-boundary assertions (\b) to make sure that only whole words are matched again).

PowerShell uses .NET's System.Text.RegularExpressions.Regex type behind the scenes - for the supported constructs, see the .NET regex-language quick reference.

mklement0
  • 382,024
  • 64
  • 607
  • 775
0

EDIT : I missread the question as using either BASH or powershell, instead of batch, but i'm leaving my answer anyways for those who might need it. Sorry for the confusion

Not the most elegant solution, but using bash's string manipulation without using regex :

#!/bin/bash

while read -r line; do
  found=0
  for word in $line; do
    for scan in $sentences; do
      [[ $word =~ $scan ]] && found=1
    done
  done
  [[ $found == 0 ]] && echo $line >> output.txt
  sentences="${sentences} $line"
done < file.txt

So basically read every line in file text.txt

Set found to 0

For each word in the line to scan and for each word found printed so far, check if there's a match, if yes set found to 1

If found at 0, output line, else do nothing

EDIT : Here is a more verbose version showing you what's happening :

#!/bin/bash

while read -r line; do
  found=0
  echo "Scanning line : $line"
  for word in $line; do
    echo "Scanning word : $word"
    for scan in $sentences; do
      [[ $word =~ $scan ]] && found=1
    done
  done
  [[ $found == 0 ]] && echo $line >> output.txt
  sentences="${sentences} $line"
  echo "Words to check : $sentences"

done < file.txt
Dexirian
  • 515
  • 3
  • 15
0

There is probably a more elegant way to do this. This creates a hash with a count of each unique word. If all words are unique, the line is output.

Get-Content './dupfile.txt' |
    ForEach-Object {
        $words = $_ -split ' '
        $allUnique = $true
        $wordhash = @{}
        foreach ($word in $words) {
            if (($word -ne '') -and ($wordhash[$word] -gt 0)) {
                $allUnique = $false
                break;
            }
            $wordhash[$word]++
        }

        if ($allUnique) { "$_" }
    }
lit
  • 14,456
  • 10
  • 65
  • 119
0

This is a PowerShell way that is not that elegant. It relies on Group-Object to count unique words in each line.

Get-Content file.txt | Foreach-Object {
  if (([regex]::Matches($_,'\w+').Value | Group-Object | Select-Object -Expand Count | Measure-object -Maximum).Maximum -eq 1) {
    $_ 
  }
}
AdminOfThings
  • 23,946
  • 4
  • 17
  • 27
0

No regex required. And, when you look at this code 6 months from now, you'll quickly figure out how it works :-)

All you need to do is compare an unaltered input line with a version of itself that is a deduped list of words from that same input line. If they match, there are no dup words, so output the line. Otherwise, don't output the line

Code

cls

$fileContent = Get-Content -LiteralPath "C:\temp\file.txt" 
$out = ""

# Step thru each line. Make a version of the line with al original words. Make a verison of the line with a deduped list of words. 
# Output the orginal line only if the line with the deduped list of words matches

foreach ($line in $fileContent)
{
    #trim leading and trailing spaces. Change all to lower case so that Select-Object -Unique acts on all words
    $line = $line.Trim().ToLower()

    #not sure if Select-Object -Unique requires a sorted list - sort it to make sure
    $lineWordsSorted = @($line.Split(" ") | Sort) 
    $uniqueLineWordsSorted = @($lineWordsSorted | Select-Object -Unique)

    if (($lineWordsSorted -join "") -eq ($uniqueLineWordsSorted -join ""))
    {
        $out += $line + [Environment]::NewLine
    }
}

Set-Content -LiteralPath "C:\temp\fileOut.txt" -Force -Value $out

Input File

enter image description here

Output File

enter image description here

VA systems engineer
  • 2,856
  • 2
  • 14
  • 38