0

If I am given two arbitrary numbers which are the start and end (inclusive) of a continuous sequence of natural numbers, what's an efficient way in R to finding the smallest set of substrings which "contain" every number in that sequence from start to end?

Edit: To clarify what's happening, the following link has two examples chosen to represent different cases and to be easy to eyeball. Our actual working input involves much larger number sequences with a lot more digits, which is why our current approach relies on recursion.

The first example has a starting point of 500 and an ending point of 699, meaning the set being operated on is all numeric strings from 500 to 699 inclusive. The solution is "5" because it contains every string whose characters begin with "5". IE from 500 through 599. Likewise for "6".

The second example is more complicated. The given start and end points are 533 to 555. This means the first part of the solution is "533-539". It can't simply be "53" because that would include "530, 531, 532" which are not included in the original range given. So 533 to 539 must be enumerated fully. However the next part of the solution is one digit shorter, it's only "54" because every permutation of "54X" from "540" to "549" is included. Then the final part is counting out "550-555" because again not every number that can start with the substring "55" is part of the given range.

So you can see how this plays out visually as well here's our current code and some sample data. You can paste this right into something like PHPtester.net and see how each pair of "start" and "end" is turned into a set of substrings.

I've been trying to convert this to R and replace his loops with vectorized alternatives or functions like map as much as possible but I'm still basically following his original solution:

  1. Divide the entire Start to End sequence into groups of 10 with each group's "name" being the parent string (IE "55"=c(550:559)).

  2. Check how long each group is and if it's less than 10 items long export it to the output list and delete it, otherwise if it's 10 items long delete it and replace it with the one-digit-shorter substring.

  3. Repeat the process recursively until you no longer get any groups 10 items long.

I realised this reminded me of what I vaguely remember from set theory in undergrad. Is there a string or set analysis package that already handles this specific kind of problem? Or a better method for implementing this solution? Right now the best R implementation I can think of is leaning heavily on purrr and dplyr to group and nest/unnest things as necessary, but instinct tells me that's probably going to scale badly when I start throwing tens of thousands of start-end pairs at it.

I'm also open to giving python a shot if it could offer a much better solution, although I'm still brand new at it and far more familiar with R.

D3SL
  • 117
  • 8
  • Pierre, respectfully, if you had read even the first four lines you would have seen the bright blue link titled "here's our current solution and some sample data". You can copy and paste that right into something like phptester.net and have a fully working example complete with labelled input and output. The question was whether that methodology is viable and efficient in R or there's better. – D3SL Apr 16 '20 at 07:07
  • I have read your question. I even read your reddit post. Your question is still unclear. Why does 500:699 in your PHP example only give you two results (5___example 1 and 6___example 1). This range surely has a lot of subsets of 10 which should give 50___example 1, 51___example 1, etc. I'm not an expert in PHP, but I have the feeling this is trivial in R. Please read this and include the R code you have tried and the expected result: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Pierre Lapointe Apr 16 '20 at 11:20
  • Ok so the issue isn't that you don't have sample data and desired result, it's that my written explanation in the first paragraph is unclear. That I can edit in. I'm very unsure of how to describe this in formal terms, so the explanation is fairly long winded. Please let me know if there are any parts I can elaborate on further. Also I think you missed where I specifically state that the current recursive approach keeps only subsets of *less* than 10 items, since anything with a full count of 10 can be defined by one digit less (IE "55" defines everything from "550" to "559") – D3SL Apr 16 '20 at 13:19

2 Answers2

2

Here's one way to do it. Basically, starting with your initial sequence, you look for full sequences of hundreds, remove them from the sequence. Then look for full sequences of tens, remove them from the sequence and combine the rest.

x <- 533:555
result <- NULL

#full hundreds
my.list <- split(x,floor(x/100)*100)
full_hundreds <- which(lengths(my.list)==100)
if(length(full_hundreds)>0){
  result <- c(result,substring(names(full_hundreds), 1,1))
  x <- as.vector(unlist(my.list[-full_hundreds]))
}

#full tens
if(length(x)>0){
  my.list <- split(x,floor(x/10)*10)
full_tens <- which(lengths(my.list)==10)
if(length(full_tens)>0){
  result <- c(result,substring(names(full_tens), 1,2))
  x <- as.vector(unlist(my.list[-full_tens]))
  }
}

result <- c(result,x)
# [1] "54"  "533" "534" "535" "536" "537" "538" "539" "550" "551" "552" "553" "554" "555"

With:

x <- 500:699
#[1] "5" "6"
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
  • So pretty much the same general approach, divide Start:End into groups of 10+remainder and only save whatever comes up short. If I wanted to make this applicable to any arbitrary length string inputs do you think there's a better approach than recursing the tens place and then reconstructing a new list one string character shorter from the end? – D3SL Apr 16 '20 at 13:33
  • My approach is not recursive. I check for "full hundreds" first. Then, on the rest, I check for "full tens" and then add the rest. This works for any sequence between 100 and 999. If your sequence has numbers above 1000, add a "full thousands" at the beggining. It's very fast. – Pierre Lapointe Apr 16 '20 at 13:54
  • Yes, I was asking if you would turn to recursion if you had to do this because you needed to work with strings of arbitrary length (up to a set maximum) or whether you'd just add on "Full XXX" checks up to the maximum digit length and potentially some basic IF logic to decide where to start? – D3SL Apr 16 '20 at 15:12
  • You don't need recursion. First, use i = (floor(log10(x[2])) + 1) on your upper bound x[2] to get the max number of digits (i). Then, replace the two sections Pierre wrote with a single loop of the form "for (j in seq(i,0,-1)) {}". In the body of the loop you'd replace 100 and 10 with 10**j, and would replace the last substr argument with (i - j) + 1, and so forth. – Patrick B. Apr 22 '20 at 19:45
1

This is what Pierre LaPointe's answer would look like in the general case (note you do 10 ** (j-1), not 10**j as I mentioned in the comment).

x <- 780:913
result <- NULL
ndigits <- as.integer(log10(max(x))) + 1
for (j in seq(ndigits, 1, -1)) {
    ej <- 10 ** (j - 1)
    my.list <- split(x, floor(x / ej) * ej)
    full_0s <- which(lengths(my.list) == ej)
    if (length(full_0s) > 0){
        result <- c(result, substring(names(full_0s), 1, 1 + (ndigits - j)))
        x <- as.vector(unlist(my.list[-full_0s]))
    }
}

result <- c(result, x)

Returns:

> sort(result)
[1] "78"  "79"  "8"   "90"  "910" "911" "912" "913"
Patrick B.
  • 474
  • 5
  • 9
  • So you grab the "shortest" answers first. The method I worked out was unnesting every start:end sequence vertically into a column, making another column of floor(dt$x/10), and then [using data.table's fast group/index](https://pastebin.com/JNw8SdZ9) to recursively count the "10s" place per group and shorten anything with a full 10 by 1 digit. I need to get yours working with long ints and then benchmark our two approaches to see which is faster and more memory efficient. – D3SL Apr 23 '20 at 10:25