1

Not sure if this is stack or code review, as I'm open to completely different approaches to the problem and, though I've started with PowerShell, am not wedded to a particular language or style.

I'm currently working with a web server on which we aren't authorised to access the back end.

It returns a list of generated certificates based on a left-justified filter, e.g. if you type 100 in the search box and click submit it will search for all certificates beginning with 100*, or the range 10000000 - 10099999

All of our certs are eight digit numbers giving a sample space of 00000000-99999999. I'm attempting to find which certificates, in this sample space, actually exist, given the certificate names must be unique.

The major caveat is that the server will only return the first 100 results, if your query returns more than that many results due to there being more than 100 extant certificates in that range, the extras are discarded.

My first approach was to just use wget (technically PowerShell's Invoke-WebRequest) and iterate through the range of queries 000000 to 999999 (100 at a time), which was working & I was on track for a mid-September finish.

Unfortunately there are people that want this data sooner, so I've had to write a recursive function that (with my default input) queries a ten million-certs large sample space at once and searches a progressively smaller space until < 99 certs are returned for each subspace, then moving onto the next ten million.

The data isn't evenly distributed or very predictable, 'most' (~90%?) certs cluster around 10000000-19999999 and 30000000-39999999 but I need them all.

Here's the function I'm currently using, it seems to be working (results are being written to file, and faster than before), but it is still ongoing. Are there any:

  1. Glaring errors with the function
  2. Better choices of inputs (for better efficiency)
  3. Completely different approaches that would be better

The variable '$certsession' is established outside this snippet and represents the web server session (login information, cookies etc.)

function RecurseCerts ($min,$max,$step,$level) {
    for ($certSpace = $min; $certSpace -le $max; $certSpace += $step) {
        $levelMultiplier = "0" * $level
        #Assuming a level of 3, these ToString arguments would turn a '5' into 005, a '50' into 050, and so on. Three or more digit numbers are unchanged.
        $query = ($certSpace).ToString($levelMultiplier)
        $resultsArray = New-Object System.Collections.ArrayList
        "Query is $query"
        #Get webpage, split content by newline, search for lines with a certificate common name and add them to the results array
        Invoke-WebRequest -uri "https://webserver.com/app?service=direct%2F1%2FSearchPage%2F%24Form&sp=S0&Form0=%24TextField%2C%24Submit&%24TextField=$query&%24Submit=Search" -websession $certsession  | %{$_.content -split "`n" | %{if ($_ -match "cn=(.*?),ou") {$resultsArray = $resultsArray + $matches[1]}}}
        #If we got more than 98 results for our query, make the search more specific, until we don't get more than 98 (else condition).
        if ($resultsArray.count -gt 98) {"Recursing at $certSpace"; $subLevel = $level + 1; $subSpace = $certSpace * 10;  RecurseCerts -min $subSpace -max ($subSpace + 9) -step 1 -level $subLevel}
        #This is the most specific 0-98 for this range, write it out to the file
        else {"Completed range $certspace"; $resultsArray | out-file c:\temp\certlist.txt -encoding utf8 -append}
    }
}

#Level 3 means include rightmost 3 digits eg. search 101 for range 10100000 - 10199999
#Level 4 would be the subspace 1010-1019 (so a search for 1015 returns 10150000 - 10159999)
RecurseCerts -min 0 -max 9 -step 1 -level 1

Since I've added 'language agnostic', feel free to ask for any needed PowerShell clarifications. I could also attempt to re-write it in pseudo-code if desired.

I think the fact that ranges are already iterated should prevent duplication when it is done with a subspace and jumps back to the higher level (re-capturing things it already captured at a lower level should be prevented), but I'd be lying if I said I fully understood the program flow here.

If it turns out there is duplication I can just filter the text file for duplicates. However, I'd still be interested in approaches that eliminate this problem if it exists.

*I've updated the code to display an indicator of progress to the console, and based on suggestions also changed the array type used to arraylist. The server is pretty fragile so I've avoided multi-threading for now, but it would normally be a useful feature of tasks like this - here's a summary of some ways to do this in PowerShell.

Here's an example of the behaviour currently. Notably the entire ten-million range 00000000 - 09999999 had less than 98 certificates and was thus processed without needing a recursion.

RecurseCerts behaviour

Community
  • 1
  • 1
Bruno
  • 111
  • 6
  • I've updated the array type used to ArrayList based on your suggestion and did notice an immediate performance improvement. My old version actually completed in 3-4 hours yesterday with a total of 20K certificates. The server's over LAN, but old and fragile - I can track memory usage from the web page, and it is coping with requests currently, but don't want to chance increased load. A response page with 100 entries is 43 KB (dynamic size based on results returned). Is there a way to avoid parsing the DOM? perhaps unix wget (I have gnuwin32)? Re ruling out ranges,as per the end image it does. – Bruno Sep 09 '16 at 01:45
  • Also, if you want to make the notes on ArrayList, multithreading and DOM construction into an answer I'd be happy to accept it. These will help the general case answer where potentially most ranges must be recursed, whereas my success yesterday was mostly luck that the certificates were fairly clustered and a lot of ranges could be eliminated with no or minimal recursion. – Bruno Sep 09 '16 at 02:01
  • That was surprisingly quick; ok I've moved my comments into an answer, fleshed them out a bit and added a bit about avoiding the DOM parsing. – TessellatingHeckler Sep 10 '16 at 05:44

1 Answers1

0

Moving my comments to an answer:

  • First suggestion: become authorised to access the back end.

  • The big room for performance increase is threading/split work over multiple clients. Since it's just a big space of numbers you could easily:

    • have two PowerShell processes running, searching 00000000-49999999 in one and 50000000-99999999 in the other (or as many processes as you want).
    • have two computers doing it, if you have others to access
    • use PowerShell multiprocessing (threading, jobs, workflows) although those are more complex to use.
    • Since this is mostly going to be server/network bound slowness, it's probably not worth the more difficult techniques, but script with start/end numbers and run it twice would be quite easy.
  • The code $resultsArray = $resultsArray + $matches[1] is very slow; arrays are immutable (fixed size) so this causes PowerShell to make a new array and copy the array into it. In a loop, adding many thousands of things, it will have a lot of overhead. Use $a = [System.Collections.ArrayList]@() and $a.Add($thing) instead.

  • How fast can the server respond (is it on the LAN or Internet)? If it's over a WAN connection there's a latency limit to how fast you can go, but if it's searching a big database and takes a while to return a page, that puts a bigger limit on what you can speed up from the client side.

  • How big is the response page? Invoke-WebRequest parses the HTML into a full DOM and it's very slow, and you're not using the DOM so you don't need that. You can use [System.Net.WebClient] to download the content as a string:

e.g.

$web = New-Object System.Net.WebClient
$web.DownloadString($url)
  • In terms of design, how many certificates are you expecting in the 100M search space? 10k? 50M? Your recursive function risks searching and pulling and ignoring the same certificates over and over trying to get below 100. Depending on distribution, I'd be tempted to look for and block-out the biggest chunks with 0 certificates. If you can rule out a range of 1M in one request that's enormously useful. Searching 1M, finding too many certs, searching 500K, too many, [...] searching 10K finding too many, seems wasteful and slow.
TessellatingHeckler
  • 27,511
  • 4
  • 48
  • 87