If I am given two arbitrary numbers which are the start and end (inclusive) of a continuous sequence of natural numbers, what's an efficient way in R to finding the smallest set of substrings which "contain" every number in that sequence from start to end?
Edit: To clarify what's happening, the following link has two examples chosen to represent different cases and to be easy to eyeball. Our actual working input involves much larger number sequences with a lot more digits, which is why our current approach relies on recursion.
The first example has a starting point of 500 and an ending point of 699, meaning the set being operated on is all numeric strings from 500 to 699 inclusive. The solution is "5" because it contains every string whose characters begin with "5". IE from 500 through 599. Likewise for "6".
The second example is more complicated. The given start and end points are 533 to 555. This means the first part of the solution is "533-539". It can't simply be "53" because that would include "530, 531, 532" which are not included in the original range given. So 533 to 539 must be enumerated fully. However the next part of the solution is one digit shorter, it's only "54" because every permutation of "54X" from "540" to "549" is included. Then the final part is counting out "550-555" because again not every number that can start with the substring "55" is part of the given range.
So you can see how this plays out visually as well here's our current code and some sample data. You can paste this right into something like PHPtester.net and see how each pair of "start" and "end" is turned into a set of substrings.
I've been trying to convert this to R and replace his loops with vectorized alternatives or functions like map
as much as possible but I'm still basically following his original solution:
Divide the entire Start to End sequence into groups of 10 with each group's "name" being the parent string (IE "55"=c(550:559)).
Check how long each group is and if it's less than 10 items long export it to the output list and delete it, otherwise if it's 10 items long delete it and replace it with the one-digit-shorter substring.
Repeat the process recursively until you no longer get any groups 10 items long.
I realised this reminded me of what I vaguely remember from set theory in undergrad. Is there a string or set analysis package that already handles this specific kind of problem? Or a better method for implementing this solution? Right now the best R implementation I can think of is leaning heavily on purrr and dplyr to group and nest/unnest things as necessary, but instinct tells me that's probably going to scale badly when I start throwing tens of thousands of start-end pairs at it.
I'm also open to giving python a shot if it could offer a much better solution, although I'm still brand new at it and far more familiar with R.