How do I optimize findstr for sorted values

Question

Let's say I have huge text file which has rows in format like

Id Name Address

All the records are sorted by Id. If I am searching for an Id, how would I make it more efficient to search using findstr or write something better than findstr?

Agreeing with @BACON conclusions I thnk you have to know [the findstr restrictions](https://stackoverflow.com/questions/8844868/what-are-the-undocumented-features-and-limitations-of-the-windows-findstr-comman) otherwise, if you can live with those I'd anchor the search either with /B or ^ and include the field separator. — , Jun 15 '18 at 07:29

Lance U. Matthews · Answer 1 · 2018-06-15T02:21:09.563

As a native application I would not be surprised if findstr has better search performance than most anything one could implement in PowerShell code or even a compiled .NET module. The problem with findstr is it is oblivious to the structure of your data. That is, if you search for the record with ID 123 it will happily returns records with ID 1234 or address "123 Main Street" as false positives. You could potentially use the /B or /R switches to combat this, but that still doesn't help in the case where you search for an ID that doesn't exist; findstr only stops searching when it reaches the end of the file.

Your ability to perform an optimized search depends on the specific format of the text file. If lines are fixed-length, meaning you can instantly seek to the $nth line by simply calculating $n * $lineLength, then you could quickly search the file for an ID using a binary search.

If lines are variable-length, then there's really no simple way to efficiently search the file other than line-by-line. Even if you've read enough of a line to know the ID doesn't match, you still need to read the rest of the line to know where the next line begins. At best, since the lines are sorted by ID you know that if you encounter a line with an ID greater than the one you're searching for you can abort the search immediately because that ID won't be found.

In the past I have been able to employ a binary search on text files with variable-length lines (fixed-sized characters would be very helpful, too, if not required). The key is for each iteration of the search, calculate your next offset and if it happens to land on the beginning of a line, great; if not, seek backwards until you can identify the character that is the beginning of the line (e.g. preceded by a CrLf). Once you've got yourself positioned on the start of a line, you can read the ID and determine if it's a match or in which direction the next iteration of the search needs to look.

It's definitely not a quick and simple solution (to write), but, depending on how huge is "huge", it could yield significant results when searching your file. Although, at that point it might be better to invest your development time in changing to a more search-friendly way of storing your data, if at all possible.

Side note: `findstr "\<123\>"` allows for matching 123 without matching 1234, and regex matching (`/r`) is the default mode. — Ansgar Wiechers, Jun 15 '18 at 08:14
Thanks for the detail reply, binary search is what the obvious solution to this looks like, I tried some sample coding but reading a 400MB file to get no of rows takes 4sec while findstr will find the id in the last row in about 3/4 sec I will experiment today to find out how what is the best way to find total number of rows in the file so that I can implement binary search also thanks for mentioning the troubles to manipulate the file pointer. I will keep that into consideration — sachin saner, Jun 15 '18 at 15:48

How do I optimize findstr for sorted values

1 Answers1