Not sure if this is stack or code review, as I'm open to completely different approaches to the problem and, though I've started with PowerShell, am not wedded to a particular language or style.
I'm currently working with a web server on which we aren't authorised to access the back end.
It returns a list of generated certificates based on a left-justified filter, e.g. if you type 100 in the search box and click submit it will search for all certificates beginning with 100*, or the range 10000000 - 10099999
All of our certs are eight digit numbers giving a sample space of 00000000-99999999. I'm attempting to find which certificates, in this sample space, actually exist, given the certificate names must be unique.
The major caveat is that the server will only return the first 100 results, if your query returns more than that many results due to there being more than 100 extant certificates in that range, the extras are discarded.
My first approach was to just use wget (technically PowerShell's Invoke-WebRequest) and iterate through the range of queries 000000 to 999999 (100 at a time), which was working & I was on track for a mid-September finish.
Unfortunately there are people that want this data sooner, so I've had to write a recursive function that (with my default input) queries a ten million-certs large sample space at once and searches a progressively smaller space until < 99 certs are returned for each subspace, then moving onto the next ten million.
The data isn't evenly distributed or very predictable, 'most' (~90%?) certs cluster around 10000000-19999999 and 30000000-39999999 but I need them all.
Here's the function I'm currently using, it seems to be working (results are being written to file, and faster than before), but it is still ongoing. Are there any:
- Glaring errors with the function
- Better choices of inputs (for better efficiency)
- Completely different approaches that would be better
The variable '$certsession' is established outside this snippet and represents the web server session (login information, cookies etc.)
function RecurseCerts ($min,$max,$step,$level) {
for ($certSpace = $min; $certSpace -le $max; $certSpace += $step) {
$levelMultiplier = "0" * $level
#Assuming a level of 3, these ToString arguments would turn a '5' into 005, a '50' into 050, and so on. Three or more digit numbers are unchanged.
$query = ($certSpace).ToString($levelMultiplier)
$resultsArray = New-Object System.Collections.ArrayList
"Query is $query"
#Get webpage, split content by newline, search for lines with a certificate common name and add them to the results array
Invoke-WebRequest -uri "https://webserver.com/app?service=direct%2F1%2FSearchPage%2F%24Form&sp=S0&Form0=%24TextField%2C%24Submit&%24TextField=$query&%24Submit=Search" -websession $certsession | %{$_.content -split "`n" | %{if ($_ -match "cn=(.*?),ou") {$resultsArray = $resultsArray + $matches[1]}}}
#If we got more than 98 results for our query, make the search more specific, until we don't get more than 98 (else condition).
if ($resultsArray.count -gt 98) {"Recursing at $certSpace"; $subLevel = $level + 1; $subSpace = $certSpace * 10; RecurseCerts -min $subSpace -max ($subSpace + 9) -step 1 -level $subLevel}
#This is the most specific 0-98 for this range, write it out to the file
else {"Completed range $certspace"; $resultsArray | out-file c:\temp\certlist.txt -encoding utf8 -append}
}
}
#Level 3 means include rightmost 3 digits eg. search 101 for range 10100000 - 10199999
#Level 4 would be the subspace 1010-1019 (so a search for 1015 returns 10150000 - 10159999)
RecurseCerts -min 0 -max 9 -step 1 -level 1
Since I've added 'language agnostic', feel free to ask for any needed PowerShell clarifications. I could also attempt to re-write it in pseudo-code if desired.
I think the fact that ranges are already iterated should prevent duplication when it is done with a subspace and jumps back to the higher level (re-capturing things it already captured at a lower level should be prevented), but I'd be lying if I said I fully understood the program flow here.
If it turns out there is duplication I can just filter the text file for duplicates. However, I'd still be interested in approaches that eliminate this problem if it exists.
*I've updated the code to display an indicator of progress to the console, and based on suggestions also changed the array type used to arraylist. The server is pretty fragile so I've avoided multi-threading for now, but it would normally be a useful feature of tasks like this - here's a summary of some ways to do this in PowerShell.
Here's an example of the behaviour currently. Notably the entire ten-million range 00000000 - 09999999 had less than 98 certificates and was thus processed without needing a recursion.