1

I have a batch to check the duplicate line in TXT file (over one million line) with 13MB, that will be running over 2hr...how can I speed up that? Thank you!!

TXT file

11
22
33
44
.
.
.
44 (over one million line)

Existing Batch

setlocal
set var1=*
sort original.txt>sort.txt
for /f %%a in ('type sort.txt') do (call :run %%a)
goto :end
:run
if %1==%var1% echo %1>>duplicate.txt
set var1=%1
goto :eof
:end
  • Use PowerShell? – Roger Lipscombe Mar 03 '17 at 09:06
  • @RogerLipscombe Or no CLI at all – NullDev Mar 03 '17 at 09:11
  • Only try running with BAT file...Could you show me the powershell code about that? – Alfred Suen Work Mar 03 '17 at 09:16
  • I testing with powershell code ( $lines = @(); Get-Content 1.txt | %{ if (($lines -eq $_).length -eq 0) {$lines = $lines + $_}}; $lines > done .txt) and still running over 45mins...not yet done – Alfred Suen Work Mar 03 '17 at 10:07
  • `Get-Content .\example.txt | Group-Object | Where { $_.Count -ne 1 }` – Roger Lipscombe Mar 03 '17 at 13:09
  • Reported times of some solutions: Original code: 2 hours 40 minutes. [aschipfl's code](https://stackoverflow.com/a/42575264/778560): 12 hours. [Magoo's code](https://stackoverflow.com/a/42575261/778560): 2 hours. [PowerShell solution](https://stackoverflow.com/questions/42574625/windows-batch-for-loop-improvement#comment72284862_42574625): 45+ minutes. [Aacini's code](https://stackoverflow.com/a/42576258/778560): 1 minute... – Aacini Feb 01 '19 at 12:57

4 Answers4

2

This should be the fastest method using a Batch file:

@echo off
setlocal EnableDelayedExpansion

set var1=*
sort original.txt>sort.txt
(for /f %%a in (sort.txt) do (
   if "%%a" == "!var1!" (
      echo %%a
   ) else (
      set "var1=%%a"
   )
)) >duplicate.txt
Aacini
  • 65,180
  • 12
  • 72
  • 108
  • Since `sort` works case-insensitively, there might be some duplicates not detected: imagine three lines `duplicate`, `Duplicate`, `duplicate`; your script is not going to report duplicates, unless you add `/I` to your `if` query; if the OP wants a case-sensitive approach, `sort` will not help... _(this is not a [revenge comment](http://stackoverflow.com/questions/42574625/windows-batch-for-loop-improvement/42575264#comment72285883_42575264) ;-))_ – aschipfl Mar 03 '17 at 11:14
  • @aschipfl: I suppose you are right, although the original code have not the `/I` switch and the example data are just numbers... Just the OP may clear this point. And talking about _revenges_, I invite you to review [my new solution](http://stackoverflow.com/a/42578073/778560)! **`;)`** – Aacini Mar 03 '17 at 11:52
  • Just to clear this point: are you saying that your original method took over 2 hr, the PowerShell method took over 45 mins, and my solution took 1 min? Using _the same_ data file? **`:)`** I'll appreciate it if you post here the times in `HH:MM:SS` format of _all_ methods posted here that you have tested... – Aacini Mar 05 '17 at 15:23
  • Ah! And please do an additional test changing this line: `sort original.txt>sort.txt` by this one: `sort original.txt /O sort.txt` – Aacini Mar 05 '17 at 15:26
2

This method use findstr command as in aschipfl's answer, but in this case each line and its duplicates are removed from the file after being revised by findstr. This method could be faster if the number of duplicates in the file is high; otherwise it will be slower because the high volume data manipulated in each turn. Just a test may confirm this point...

@echo off
setlocal EnableDelayedExpansion

del duplicate.txt 2>NUL
copy /Y original.txt input.txt > NUL

:nextTurn
for %%a in (input.txt) do if %%~Za equ 0 goto end

< input.txt (
   set /P "line="
   findstr /X /C:"!line!"
   find /V "!line!" > output.txt
) >> duplicate.txt

move /Y output.txt input.txt > NUL
goto nextTurn

:end
Aacini
  • 65,180
  • 12
  • 72
  • 108
  • Although I am not sure whether `find /V "!line!"` should be replaced by `findstr /V /X /C:"!line!"`, I like this method because it does not loop through the text file line by line; +1... – aschipfl Mar 03 '17 at 13:27
  • @aschipfl: The `findstr` command get duplicates and output they to `duplicate.txt`. The `find` command delete duplicates and store the rest of lines in `output.txt`. Further details [here](http://stackoverflow.com/questions/8844868/what-are-the-undocumented-features-and-limitations-of-the-windows-findstr-comman/28278628#28278628) – Aacini Mar 03 '17 at 15:41
  • Since you want to handle whole lines, `findstr /X` is needed; `find` also matches in case the search string is found in the middle of a line... – aschipfl Mar 06 '17 at 08:05
0
@echo off
setlocal enabledelayedexpansion
set var1=*
(
for /f %%a in ('sort q42574625.txt') do (
 if "%%a"=="!var1!" echo %%a
 set "var1=%%a"
)
)>"u:\q42574625_2.txt"

GOTO :EOF

This may be faster - I don't have your file to test against

I used a file named q42574625.txt containing some dummy data for my testing.

It's not clear whether you want only one instance of a duplicate line or not. Your code would produce 5 "duplicate" lines if there were 6 identical lines in the source file.

Here's a version which will report each duplicated line only once:

@echo off
setlocal enabledelayedexpansion
set var1=*
set var2=*
(
for /f %%a in ('sort q42574625.txt') do (
 if "%%a"=="!var1!" IF "!var2!" neq "%%a" echo %%a&SET "var2=%%a"
 set "var1=%%a"
)
)>"u:\q42574625.txt"

GOTO :EOF
Magoo
  • 77,302
  • 8
  • 62
  • 84
0

Supposing you provide the text file as the first command line argument, you could try the following:

@echo off
for /F "usebackq delims=" %%L in ("%~1") do (
    for /F "delims=" %%K in ('
        findstr /X /C:"%%L" "%~1" ^| find /C /V ""
    ') do (
        if %%K GTR 1 echo %%L
    )
)

This returns all duplicate lines, but multiple times each, namely as often as each occurs in the file.

aschipfl
  • 33,626
  • 12
  • 54
  • 99
  • Thank you for your code, I trying and report you once done! – Alfred Suen Work Mar 03 '17 at 10:03
  • I am pretty sure that this method will be _slower_ than the original. You are running _three copies_ of cmd.exe (one for the nested `for /F` command and one more for each side of the pipe) plus `findstr.exe` (that process the _entire_ file) plus `find.exe`, for _each line_ of the file! – Aacini Mar 03 '17 at 10:32
  • @Aacini, yes, you might be right, I guess. I did not test it, but my thought was that the `findstr` command might be faster than `for /F` containing `if` comparisons and sub-routine `call`s. – aschipfl Mar 03 '17 at 11:07