15

Is it possible to remove duplicate rows from a text file? If yes, how?

aschipfl
  • 33,626
  • 12
  • 54
  • 99
Rocshy
  • 3,391
  • 11
  • 39
  • 56

8 Answers8

13

Sure can, but like most text file processing with batch, it is not pretty, and it is not particularly fast.

This solution ignores case when looking for duplicates, and it sorts the lines. The name of the file is passed in as the 1st and only argument to the batch script.

@echo off
setlocal disableDelayedExpansion
set "file=%~1"
set "sorted=%file%.sorted"
set "deduped=%file%.deduped"
::Define a variable containing a linefeed character
set LF=^


::The 2 blank lines above are critical, do not remove
sort "%file%" >"%sorted%"
>"%deduped%" (
  set "prev="
  for /f usebackq^ eol^=^%LF%%LF%^ delims^= %%A in ("%sorted%") do (
    set "ln=%%A"
    setlocal enableDelayedExpansion
    if /i "!ln!" neq "!prev!" (
      endlocal
      (echo %%A)
      set "prev=%%A"
    ) else endlocal
  )
)
>nul move /y "%deduped%" "%file%"
del "%sorted%"

This solution is case sensitive and it leaves the lines in the original order (except for duplicates of course). Again the name of the file is passed in as the 1st and only argument.

@echo off
setlocal disableDelayedExpansion
set "file=%~1"
set "line=%file%.line"
set "deduped=%file%.deduped"
::Define a variable containing a linefeed character
set LF=^


::The 2 blank lines above are critical, do not remove
>"%deduped%" (
  for /f usebackq^ eol^=^%LF%%LF%^ delims^= %%A in ("%file%") do (
    set "ln=%%A"
    setlocal enableDelayedExpansion
    >"%line%" (echo !ln:\=\\!)
    >nul findstr /xlg:"%line%" "%deduped%" || (echo !ln!)
    endlocal
  )
)
>nul move /y "%deduped%" "%file%"
2>nul del "%line%"


EDIT

Both solutions above strip blank lines. I didn't think blank lines were worth preserving when talking about distinct values.

I've modified both solutions to disable the FOR /F "EOL" option so that all non-blank lines are preserved, regardless what the 1st character is. The modified code sets the EOL option to a linefeed character.


New solution 2016-04-13: JSORT.BAT

You can use my JSORT.BAT hybrid JScript/batch utility to efficiently sort and remove duplicate lines with a simple one liner (plus a MOVE to overwrite the original file with the final result). JSORT is pure script that runs natively on any Windows machine from XP onward.

@jsort file.txt /u >file.txt.new
@move /y file.txt.new file.txt >nul
dbenham
  • 127,446
  • 28
  • 251
  • 390
  • 1
    Ran into finstr search string is too long. – Dreaded semicolon Mar 28 '16 at 07:47
  • 2
    @Dreadedsemicolon - Yes, I didn't think to mention that the 2nd option fails if any lines exceed length 511 (127 on XP) due to FINDSTR limits. – dbenham Mar 28 '16 at 12:49
  • @dbenham Does it change the order of the lines in the output file ?? I want the order to be as it was in input file, with some parameter can this be forced ? – Vicky Dev Aug 02 '22 at 00:32
  • @VickyDev That info is already in the answer - only the 2nd option preserves order. And no, the other 2 cannot be modified to preserve the order because they only work with sorted lines. – dbenham Aug 03 '22 at 02:27
9

you may use uniq http://en.wikipedia.org/wiki/Uniq from UnxUtils http://sourceforge.net/projects/unxutils/

PA.
  • 28,486
  • 9
  • 71
  • 95
5

Some time ago I found an unexpectly simple solution, but this unfortunately only works on Windows 10: the sort command features some undocumented options that can be adopted:

  • /UNIQ[UE] to output only unique lines;
  • /C[ASE_SENSITIVE] to sort case-sensitively;

So use the following line of code to remove duplicate lines (remove /C to do that in a case-insensitive manner):

sort /C /UNIQUE "incoming.txt" /O "outgoing.txt"

This removes duplicate lines from the text in incoming.txt and provides the result in outgoing.txt. Regard that the original order is of course not going to be preserved (because, well, this is the main purpose of sort).

However, you sould use these options with care as there might be some (un)known issues with them, because there is possibly a good reason for them not to be documented (so far).

aschipfl
  • 33,626
  • 12
  • 54
  • 99
4

The Batch file below do what you want:

@echo off
setlocal EnableDelayedExpansion
set "prevLine="
for /F "delims=" %%a in (theFile.txt) do (
   if "%%a" neq "!prevLine!" (
      echo %%a
      set "prevLine=%%a"
   )
)

If you need a more efficient method, try this Batch-JScript hybrid script that is developed as a filter, that is, similar to Unix uniq program. Save it with .bat extension, like uniq.bat:

@if (@CodeSection == @Batch) @then

@CScript //nologo //E:JScript "%~F0" & goto :EOF

@end

var line, prevLine = "";
while ( ! WScript.Stdin.AtEndOfStream ) {
   line = WScript.Stdin.ReadLine();
   if ( line != prevLine ) {
      WScript.Stdout.WriteLine(line);
      prevLine = line;
   }
}

Both programs were copied from this post.

Community
  • 1
  • 1
Aacini
  • 65,180
  • 12
  • 72
  • 108
3
set "file=%CD%\%1"
sort "%file%">"%file%.sorted"
del /q "%file%"
FOR /F "tokens=*" %%A IN (%file%.sorted) DO (
SETLOCAL EnableDelayedExpansion
if not [%%A]==[!LN!] (
set "ln=%%A"
echo %%A>>"%file%"
)
)
ENDLOCAL
del /q "%file%.sorted"

This should work exactly the same. That dbenham example seemed way too hardcore for me, so, tested my own solution. usage ex.: filedup.cmd filename.ext

Mark
  • 8,046
  • 15
  • 48
  • 78
genetix
  • 39
  • 2
  • Just an FYI: The first `set` statement won't always work. I've seen the %CD% fail and/or get overwritten many times! You should use this instead `set "file=%~dpnx1"`. The letters in the %1 are defined as: d=drive, p=path, n=filename (without extension), x=extension. This works for the first argument even when you only pass in the filename (without path). – kodybrown Sep 24 '13 at 19:17
2

Pure batch - 3 effective lines.

@ECHO OFF
SETLOCAL
:: remove variables starting $
FOR  /F "delims==" %%a In ('set $ 2^>Nul') DO SET "%%a="

FOR /f "delims=" %%a IN (q34223624.txt) DO SET $%%a=Y
(FOR  /F "delims=$=" %%a In ('set $ 2^>Nul') DO ECHO %%a)>u:\resultfile.txt

GOTO :EOF

Works happily if the data does not contain characters to which batch has a sensitivity.

"q34223624.txt" because question 34223624 contained this data

1.1.1.1
1.1.1.1
1.1.1.1
1.2.1.2
1.2.1.2
1.2.1.2
1.3.1.3
1.3.1.3
1.3.1.3

on which it works perfectly.

Magoo
  • 77,302
  • 8
  • 62
  • 84
1

Did come across this issue and had to resolve it myself because the use was particulate to my need. I needed to find duplicate URL's and order of lines was relevant so it needed to be preserved. The lines of text should not contain any double quotes, should not be very long and sorting cannot be used.

Thus I did this:

setlocal enabledelayedexpansion
type nul>unique.txt
for /F "tokens=*" %%i in (list.txt) do (
    find "%%i" unique.txt 1>nul
    if !errorlevel! NEQ 0 (
        echo %%i>>unique.txt
    )
)

Auxiliary: if the text does contain double quotes then the FIND needs to use a filtered set variable as described in this post: Escape double quotes in parameter

So instead of:

find "%%i" unique.txt 1>nul

it would be more like:

set test=%%i
set test=!test:"=""!
find "!test!" unique.txt 1>nul

Thus find will look like find """what""" file and %%i will be unchanged.

Community
  • 1
  • 1
JasonXA
  • 268
  • 3
  • 6
1

I have used a fake "array" to accomplish this

@echo off
:: filter out all duplicate ip addresses
REM you file would take place of %1
set file=%1%
if [%1]==[] goto :EOF
setlocal EnableDelayedExpansion
set size=0
set cond=false
set max=0
for /F %%a IN ('type %file%') do (   
      if [!size!]==[0] (
          set cond=true
          set /a size="size+1"
          set arr[!size!]=%%a

      ) ELSE (
                 call :inner
                 if [!cond!]==[true] (
                     set /a size="size+1" 
                     set arr[!size!]=%%a&& ECHO > NUL                      
                 ) 
      )
)
break> %file%
:: destroys old output
for /L %%b in (1,1,!size!) do echo !arr[%%b]!>> %file%
endlocal
goto :eof
:inner
for /L %%b in (1,1,!size!) do (  
          if "%%a" neq "!arr[%%b]!" (set cond=true) ELSE (set cond=false&&goto :break)                                
)
:break

the use of the label for the inner loop is something specific to cmd.exe and is the only way I have been successful nesting for loops within each other. Basically this compares each new value that is being passed as a delimiter and if there is no match then the program will add the value into memory. When it is done it will destroy the target files contents and replace them with the unique strings