Computing the size of UID possibilities

Question

Per DICOM specification, a UID is defined by: 9.1 UID Encoding Rules. In other words the following are valid DICOM UIDs:

"1.2.3.4.5"
"1.3.6.1.4.35045.103501438824148998807202626810206788999"
"1.2.826.0.1.3680043.2.1143.5028470438645158236649541857909059554"

while the following are illegal DICOM UIDs:

".1.2.3.4.5"
"1..2.3.4.5"
"1.2.3.4.5."
"1.2.3.4.05"
"12345"
"1.2.826.0.1.3680043.2.1143.50284704386451582366495418579090595540"

Therefore I know that the string is at most 64 bytes, and should match the following regex [0-9\.]+. However this regex is really a superset, since there are a lot less than (10+1)^64 (=4457915684525902395869512133369841539490161434991526715513934826241L) possibilities.

How would one computes precisely the number of possibilities to respect the DICOM UID rules ?

Reading the org root / suffix rule clearly indicates that I need at least one dot ('.'). In which case the combination is at least 3 bytes (char) in the form: [0-9].[0-9]. In which case there are 10x10=100 possibilities for UID of length 3.

Looking at the first answer, there seems to be something unclear about:

The first digit of each component shall not be zero unless the component is a single digit.

What this means is that:

"0.0" is valid
"00.0" or "1.01" are not valid

Thus I would say a proper expression would be:

(([1-9][0-9]*)|0)(\.([1-9][0-9]*|0))+

Using a simple C code, I could find:

f(0) = 0
f(1) = 0
f(2) = 0
f(3) = 100
f(4) = 1800
f(5) = 27100
f(6) = 369000
f(7) = 4753000
f(8) = 59049000

The validation of the Root UID part is outside the scope of this question. A second validation step could take care of rejecting some OID that cannot possibly be registered (some people mention restriction on first and second arc for example). For simplicity we'll accept all possible (valid) Root UID.

score 10 · Accepted Answer · edited Dec 01 '16 at 22:30

While my other answer takes good care of this specific application, here is a more generic approach. It takes care of situations where you have a different regular expression describing the language in question. It also allows for considerably longer string lengths, since it only requires O(log n) arithmetic operations to compute the number of combinations for strings of length up to n. In this case the number of strings grows so quickly that the cost of these arithmetic operations will grow dramatically, but that may not be the case for other, otherwise similar situations.

Build a finite state automaton

Start with a regular expression description of your language in question. Translate that regular expression into a finite state automaton. In your case the regular expression can be given as

(([1-9][0-9]*)|0)(\.([1-9][0-9]*|0))+

The automaton could look like this:

Eliminate ε-transitions

This automaton usually contains ε-transitions (i.e. state transitions which do not correspond to any input character). Remove those, so that one transition corresponds to one character of input. Then add an ε-transition to the accepting state(s). If the accepting states have other outgoing transitions, don't add ε-loops to them, but instead add an ε-transition to an accepting state with no outgoing edges and then add the loop to that. This can be seen as padding the input with ε at its end, without allowing ε in the middle. Taken together, this transformation ensures that performing exactly n state transitions corresponds to processing an input of n characters or less. The modified automaton might look like this:

Note that both the construction of the first automaton from the regular expression and the elimination of ε-transitions can be performed automatically (and perhaps even in a single step. The resulting automata might be more complicated than what I constructed here manually, but the principle is the same.

Ensuring unique paths

You don't have to make the automaton deterministic in the sense that for every combination of source state and input character there is only one target state. That's not the case in my manually constructed one either. But you have to make sure that every complete input has only one possible path to the accepting state, since you'll essentially be counting paths. Making the automaton deterministic would ensure this weaker property, too, so go for that unless you can ensure unique paths without this. In my example the length of each component clearly dictates which path to use, so I didn't make it deterministic. But I've included an example with a deterministic approach at the end of this post.

Build transition matrix

Next, write down the transition matrix. Associate the rows and columns with your states (in order a, b, c, d, e, f in my example). For each arrow in your automaton, write the number of characters included in the label of that arrow in the column associated with the source state and the row associated with the target state of that arrow.

⎛ 0  0  0  0  0  0⎞
⎜ 9 10  0  0  0  0⎟
⎜10 10  0 10 10  0⎟
⎜ 0  0  1  0  0  0⎟
⎜ 0  0  0  9 10  0⎟
⎝ 0  0  0 10 10  1⎠

Read result off that matrix

Now applying this matrix with a column vector once has the following meaning: if the number of possible ways to arrive in a given state is encoded in the input vector, the output vector gives you the number of ways one transition later. Take the 64th power of that matrix, concentrate on the first column (since ste start situation is encoded as (1,0,0,0,0,0), meaning only one way to end up in the start state) and sum up all the entries that correspond to accepting states (only the last one in this case). The bottom left element of the 64th power of this matrix is

1474472506836676237371358967075549167865631190000000000000000000000

which confirms my other answer.

Compute matrix powers efficiently

In order to actually compute the 64th power of that matrix, the easiest approach would be repeated squaring: after squaring the matrix 6 times you have an exponent of 2⁶ = 64. If in some other scenario your exponent (i.e. maximal string length) is not a power of two, you can still perform exponentiation by squaring by multiplying the relevant squares according to the bit pattern of the exponent. This is what makes this approach take O(log n) arithmetic operations to compute the result for string length n, assuming a fixed number of states and therefore fixed cost for each matrix squaring.

Example with deterministic automaton

If you were to make my automaton deterministic using the usual powerset construction, you'd end up with

and sorting the states as a, bc, c, d, cf, cef, f one would get the transition matrix

⎛ 0  0  0  0  0  0  0⎞
⎜ 9 10  0  0  0  0  0⎟
⎜ 1  0  0  0  0  0  0⎟
⎜ 0  1  1  0  1  1  0⎟
⎜ 0  0  0  1  0  0  0⎟
⎜ 0  0  0  9  0 10  0⎟
⎝ 0  0  0  0  1  1  1⎠

and could sum the last three elements of the first column of its 64th power to obtain the same result as above.

I see a problem with that: If the automaton is non-deterministic, a word may have more than one path to the accepting state. If I understood your technique correctly, such a word would be counted with multiplicity, which we clearly don't want. — Hermann Döppes, Sep 13 '16 at 22:42
Ok, I found a proof that this problem doesn't occur in this particular instance (as expected) but this should still be included in the answer. — Hermann Döppes, Sep 13 '16 at 22:52
@HermannDöppes: You are right. I thought about this at some point while away from my computer, then forgot to edit it in. Thanks for reminding me! — MvG, Sep 14 '16 at 05:43
Oh, and your regex has a typo: The `\.` needs to be moved into the parentheses. — Hermann Döppes, Sep 14 '16 at 05:46
@HermannDöppes: That regex mistake is in the question already, but I should have noticed it while copying. I've now included two more sections to my post to discuss this determinism thing. And I must say I'm surprised your edit got rejected; personally I would have accepted it. I only noticed it after formulating the same idea in my own words, though. — MvG, Sep 14 '16 at 07:21
I didn't notice the error in the question, just that you got it right in your other answer. (= — Hermann Döppes, Sep 14 '16 at 08:00
Possibly interesting: log₂1474472506836676237371358967075549167865631190000000000000000000000 = 219.807453185 bits (27.475931648 bytes). On a machine with 8-bit addressable bytes, the **smallest possible representation would fit in a 28-byte buffer**. This is considerably less than the 64 bytes required for the ISO/IEC 646-based representation mandated by the DICOM UID encoding rules. Can be even less considering Silverhalide's remark that some components have a restricted range in practice. — tne, Dec 01 '16 at 22:43
@tne: True, although the encoding might be a bit complicated. [Packed BCD](https://en.wikipedia.org/wiki/Binary-coded_decimal#Packed_BCD) in [nybbles](https://en.wikipedia.org/wiki/Nibble) already halves the encoded length to 32 bytes, using an eleventh digit for dot and perhaps a twelvth for padding if you don't want to re-use dots for this. That's not much longer than 28 bytes, and way easier to encode. — MvG, Dec 02 '16 at 07:05
@tne: Actually 11⁶⁴ only has 221.4 bits or 27.7 bytes of information. So simply considering the whole UID as a base-11 number with digits `0123456789.` and representing that number in binary instead will lead to a very efficient encoding. The encoder and decoder would be pretty short if one could build upon a big integer library, but doing the big int math manually would be more tedious. — MvG, Dec 02 '16 at 07:36

MvG · Answer 2 · 2016-09-12T07:40:10.993

Single component

Start by looking for ways to form a single component. The corresponding regular expression for a single component is

0|[1-9][0-9]*

so it is either zero or a non-zero digit followed by arbitrary many zero digits. (I had missed the possible sole zero case at first, but the comment by malat made me aware of this.) If the total length of such a component is to be n, and you write h(n) to denote the number of ways to form such a component of length exactly n, then you can compute that h(n) as

h(n) = if n = 1 then 10 else 9 * 10^(n - 1)

where the n = 1 case allows for all possible digits, and the other cases ensure a non-zero first digit.

One or more components

Subsection 9.1 only writes that a UID is a bunch of dot-separated number components, as outlined above. So in regular expressions that would be

(0|[1-9][0-9]*)(\.(0|[1-9][0-9]*))*

Suppose f(n) is the number of ways to write a UID of length n. Then you have

f(n) = h(n) + sum h(i) * f(n-i-1) for i from 1 to n-2

The first term describes the case of a single component, while the sum takes care of the case where it consists of more than one component. In that case you have a first component of length i, then a dot which accounts for the -1 in the formula, and then the remaining digits form one or more components which is expressed via the recursive use of f.

Two or more components

As the comment by cneller indicates, the part of section 9 before subsection 9.1 indicates that there has to be at least two components. So the proper regular expression would be more like

(0|[1-9][0-9]*)(\.(0|[1-9][0-9]*))+

with a + at the end indicating that we want at least one repetition of the parenthesized expression. Deriving an expression for this simply means leaving out the one-component-only case in the definition of f:

g(n) = sum h(i) * f(n-i-1) for i from 1 to n-2

If you sum all the g(n) for n from 3 (the minimal possible UID length) through 64 you get the number of possible UIDs as

1474472506836676237371358967075549167865631190000000000000000000000

or approximately 1.5e66. Which is considerably less than the 4.5e66 you get from your computation, in terms of absolute difference, although it's definitely on the same order of magnitude. By the way, your estimate doesn't explicitely mention UIDs shorter than 64, but you can always consider padding them with dots in your setup. I did the computation using a few lines of Python code:

f = [0]
g = [0]
h = [0, 10] + [9 * (10**(n-1)) for n in range(2, 65)]
s = 0
for n in range(1, 65):
    x = 0
    if n >= 3:
        for i in range(1, n - 1):
            x += h[i] * f[n-i-1]
    g.append(x)
    f.append(x + h[n])
    s += x
print(h)
print(f)
print(g)
print(s)

I think this is good if you use it for the org-root and suffix components of the UID individually (UID must be .). By itself, f(1) and f(2) are not valid full UID values; f(3) should be 81 (a 3 digit UID value should be i.j where i and j are in [1-9]). So, likely the number is something like sum(f(n)*sum(f(m))) for n from 1 to 62 and m from 1 to (63-n). [Really there are only 63 flexible values since the . is required to separate the org-root and the suffix.] — cneller, Sep 09 '16 at 13:42
@cneller: Thanks for pointing this out, it seems I hadn't read enough of that document. I adapted my answer to accomodate this. I disagree with your “Really there are only 63 flexible values…”, because although you apparently have to have a dot, its position isn't fixed. The final result is still larger than the `11^63` upper bound for 63 positions would allow. — MvG, Sep 09 '16 at 14:01
I believe there is an error in your computation since the first 8 values I computed manually do not match your python script. — malat, Sep 10 '16 at 19:06
@malat: I incorrectly excluded components which are zero. Will adapt my answer when I find the time. — MvG, Sep 11 '16 at 09:50
Note that the initial component is limited to 0, 1, or 2 (which if I recall correctly, is ISO-issued, ITU-issued, and jointly issued). Further, I believe the second component is restricted to two digits. This is done so that the first two components can be encoded in a single byte. This should reduce your problem space somewhat. — esilver, Sep 13 '16 at 05:52
@Silverhalide: For [the Example OID 2.999](http://oid-info.com/get/2.999) the second component definitely exceeds two digits, but otherwise what you say [appears to be true](http://www.oid-info.com/cgi-bin/display). In my understanding it is however only a statement about the current state of the registry. I believe it would be legal to register longer roots in the future. Seeing how an OID isn't meant to be parsed, packing a certain portion of it into a single byte feels like a goal of minor value at best, but I'm no expert. — MvG, Sep 13 '16 at 07:44
A quick look at the ITU documentation indicates that the second level arcs under the 0 and 1 arcs are restricted to 0 through 40. The second level under the 2 arc isn't restricted. You are correct that OIDs aren't meant to be parsed, but the encoding is for efficient storage. There is a defined binary encoding of OIDs. — esilver, Sep 13 '16 at 08:22