0

I'm looking for a script or a program that can delete specific lines from a text file ( input.001.log.....input.log.1900), the files have 50MB size and I have around 2k files. On every line there is a string, I want to delete every line with double characters "aa" "bb" and so on, also every line with more than 5 numbers, every line with a special character except @ # & and every line with more than 2 special characters ( like a@bcd#38s# this line needs to be deleted)

As a note I don't have any programming skills, just small experience with batch scripting.

So far, I'm using this code:

@ECHO OFF 
SETLOCAL 
FOR %%i IN (input.txt) DO ( 
 TYPE "%%i"|FINDstr /l /v "aa bb cc dd ff gg hh ii jj kk ll mm nn pp qq rr ss tt uu vv xx yy zz" >"input_1.txt" 
) 
GOTO :EOF
Magoo
  • 77,302
  • 8
  • 62
  • 84
  • Possible duplicate of [Batch File to Delete File](https://stackoverflow.com/questions/43013802/batch-file-to-delete-file) – sayan Dec 17 '17 at 13:59
  • It's going to be less "removing unwanted lines" and more "copying everything to a temporary file except for the unwanted lines." Chain some `findstr`s together and use the `/v` flag to say "not that". – SomethingDark Dec 17 '17 at 14:00
  • Not a duplicate of that at all. – SomethingDark Dec 17 '17 at 14:00
  • any shortcut for duplicate characters like "aa" "bb" or I need to define them all? – Alin Draghici Dec 17 '17 at 14:36
  • so far I did this , but my skills ends here :( . Not sure how to do it for more than 5 numbers – Alin Draghici Dec 17 '17 at 14:43
  • @ECHO OFF SETLOCAL FOR %%i IN (input.txt) DO ( TYPE "%%i"|FINDstr /l /v "aa bb cc dd ff gg hh ii jj kk ll mm nn pp qq rr ss tt uu vv xx yy zz" >"input_1.txt" ) GOTO :EOF – Alin Draghici Dec 17 '17 at 14:45

1 Answers1

0

This would be easy if batch had a decent regular expression utility, but FINDSTR is extremely limited and buggy. However, FINDSTR can solve this problem rather efficiently without too much difficulty.

You aren't very clear as to what you mean by "special character". My interpretation is you only want to accept alpha characters a-z and A-Z, digits 0-9, and special characters @, #, and &. I can only guess that you are building a dictionary of potential passwords.

I find this problem easier if you build environment variables that represent various classes of characters, as well as various logical expressions, and then use the variables within your search string.

I recommend you write your modified files to a new folder.

@echo off
setlocal

set "alpha=abcdefghijklmnopqrstuvwxyz"
set "num=0123456789"
set "sym=@#&"

set "dups=aa bb cc dd ee ff gg hh ii jj kk ll mm nn oo pp qq rr ss tt uu vv ww xx yy zz 00 11 22 33 44 55 66 77 88 99 @@ ## &&"
set "bad=[^%alpha%%num%%sym%]"
set "num6=[%num%][^%num%]*[%num%][^%num%]*[%num%][^%num%]*[%num%][^%num%]*[%num%][^%num%]*[%num%]"
set "sym3=[%sym%][^%sym%]*[%sym%][^%sym%]*[%sym%]

set "source=c:\your\source\folder"
set "destination=c:\your\destination\folder"

for %%F in ("%source%\*.txt") do findstr /riv "%dups% %bad% %num6% %sym3%" "%%F" >"%destination%\%%~nxF"

Edit in response to Magoo's comment

The solution must be modified a bit if you are running on Windows XP, as that has a regular expression length limit of 127 bytes, and the %num6% expression exceeds that limit.

The solution should work on XP if you change num6 to

set "num6=[%num%].*[%num%].*[%num%].*[%num%].*[%num%].*[%num%]"

That search logically gives the same result, but it is significantly less efficient because it may require excessive backtracking during the matching process.

Magoo
  • 77,302
  • 8
  • 62
  • 84
dbenham
  • 127,446
  • 28
  • 251
  • 390
  • Sorry- this didn't work for me, generating an out-of-memory error as (I believe) the search string would be too long. I'd recommend using `set "alpha=a-z" set "num=0-9"` to reduce the length or putting the required strings into a file, then using the `/g:` option. – Magoo Dec 17 '17 at 21:34
  • @Magoo - I assume you tested on XP, as that is the only situation that I can think of that should give the problem you report. Each regex is limited to 254 bytes on Vista and above, but the search string is treated as 4 different regular expressions, so there should be no problem there. But on XP the limit is 127 characters, and the `%num6%` search has length 142. You cannot use `[a-z]` or `[0-9]`, as [that does not give the expected result](https://stackoverflow.com/a/8767815/10120530). – dbenham Dec 18 '17 at 03:06
  • The `num6` search can be changed to `[%num%].*[%num%].*[%num%].*[%num%].*[%num%].*[%num%]`, which is well within the XP limit of 127. It should give the correct result, but it is significantly less efficient due to excessive backtracking. – dbenham Dec 18 '17 at 03:08
  • @Magoo - I forgot to mention that the original solution works fine for me on Windows 10. – dbenham Dec 18 '17 at 03:26
  • Hmm - well, I'm using Win10, but I've tried it again and there was no out-of-memory problem generated. No idea why my original test failed. Only thing I had to change was the variable names for the directories (which I'll edit...) – Magoo Dec 18 '17 at 07:13