4

I'm trying to use findstr in place of grep on a barebones vanilla windows box (which is sadly a requirement). I have some relatively large files (1Gb+), and I would like to extract those lines which don't include MX, MXnn, BR, and BRnn delimited by tabs. If I were writing a 'real' regex, then

\t(MX|BR)(..)?\t

would cover it. I don't mind doing it in two stages, but I can't for the life of me seem to include the delimiter tabs.

So far I have:

findstr /V MX source.txt >> temp.txt
findstr /V BR temp.txt >> dest.txt

which due to the nature of the data does an ok-ish job, but I would really rather use something like:

findstr /R /V "\t(MX|BR)(..)?\t" source.txt >> dest.txt

I've tried double slashes, escape sequences etc. but seem to be running around in circles.

I'm loathe to resort to VBScript if I can help it.

Any ideas, given limitations of vanilla windows?


EDIT

I've looked into generating an exclusion file using the /G option, but generating might start to become problematic, once the users cotton on to the possibilities - a regex would just be a lot easier.

Mofi
  • 46,139
  • 17
  • 80
  • 143
Dycey
  • 4,767
  • 5
  • 47
  • 86

2 Answers2

3

Afaics there is no syntax to specify a horizontal tab directly. Findstr regex seems pretty basic, they don't have \s \t \d and such like :-). However you can use an input file to specify your search pattern. Inside this file you can use tabs literally. The example from your original post "\t(MX|BR)(..)?\t" would be

" (MX|BR)(..)? "

without the quotes. The leading and trailing whitespace are the tabs typed and saved in the file. Then you would use findstr with something like:

findstr /R /G:patternFileWithTabs.txt sourceFile.txt

Also you can get the job done most of the time by specifying an exclusive pattern. If you exclude all alphanumeric, common separator, other white spaces chars, likely the only thing left is a tab. For example I've been searching for a sequence like in default regex:

"\t\tUnknown\t\t\t\t0\t"

In my use case I could grep it with findstr like:

findstr /R "[ a-z0-9][ a-z0-9]Unknown[ a-z0-9]*0[ a-z0-9]" logfile.txt

Of course it depends on the actual data you have. In theory the pattern would match also other strings, but these other strings don't occur in my source file, so it works. Most of the time you don't need a 100% bullet proof pattern.

lidqy
  • 1,891
  • 1
  • 9
  • 11
2

A possible solution from the command line or in a batch file is using:

%SystemRoot%\System32\findstr.exe /V /R /C:"\<BR[0-9]*\>" /C:"\<MX[0-9]*\>" "source.txt"

The file source.txt is searched case-sensitive for lines not containing because of /V either BR with 0 or more digits or MX with 0 or more digits being an entire word because of \< and \> using because of /R the two regular expression search terms \<BR[0-9]*\> and \<MX[0-9]*\> which are combined with a logical OR by FINDSTR.

This might be already enough to filter source.txt right. But it filters out also lines containing BR[0-9]* or MX[0-9]* surrounded by other word delimiting characters than horizontal tab characters.

It is possible to use in a batch file:

%SystemRoot%\System32\findstr.exe /V /R /C:"[   ]BR[0-9]*[  ]" /C:"[    ]MX[0-9]*[  ]" "source.txt"

ATTENTION: There must be 1 horizontal tab character in the batch file between each of the 4 pairs of square brackets. The browsers display those 4 tab characters as 1 or more spaces according to HTML specification.

Open a command prompt window and run findstr /? for more information about FINDSTR.

And perhaps read also the Stack Overflow article

What are the undocumented features and limitations of the Windows FINDSTR command?

Community
  • 1
  • 1
Mofi
  • 46,139
  • 17
  • 80
  • 143