Number of Distinct Subarrays

Question

I want to find an algorithm to count the number of distinct subarrays of an array.

For example, in the case of A = [1,2,1,2], the number of distinct subarrays is 7:

{ [1] , [2] , [1,2] , [2,1] , [1,2,1] , [2,1,2], [1,2,1,2]}

and in the case of B = [1,1,1], the number of distinct subarrays is 3:

{ [1] , [1,1] , [1,1,1] }

A sub-array is a contiguous subsequence, or slice, of an array. Distinct means different contents; for example:

[1] from A[0:1] and [1] from A[2:3] are not distinct.

and similarly:

B[0:1], B[1:2], B[2:3] are not distinct.

You can check here http://stackoverflow.com/questions/2710713/algorithm-to-generate-all-possible-permutations-of-a-list — Ozan Deniz, Jul 07 '13 at 15:15
Your example is wrong. There are 8 subarrays. You forgot `[]`, which is a subarray of every array. Otherwise you have to define `sub-array` as a *non-empty* contiguous sequence... — Bakuriu, Jul 07 '13 at 19:37

Evgeny Kluev · Accepted Answer · 2013-07-07T17:15:09.430

10

Construct suffix tree for this array. Then add together lengths of all edges in this tree.

Time needed to construct suffix tree is O(n) with proper algorithm (Ukkonen's or McCreight's algorithms). Time needed to traverse the tree and add together lengths is also O(n).

edited Jul 07 '13 at 17:15

answered Jul 07 '13 at 16:40

Evgeny Kluev

24,287
7
55
98

How do you implement a suffix tree for an integer array and what would be the time complexity of your method ? – Mod Jul 07 '13 at 16:45
@Mod: As normal suffix tree with large alphabet size. Each node may be implemented as a map (key=number from the array, value=link to descendant node + "substring"). – Evgeny Kluev Jul 07 '13 at 16:52
1

Could you please provide a clear implementation or reference and the complexity? – Mod Jul 07 '13 at 17:02
You can create a structure with the same results as a suffix tree which is easier to implement (but probably less efficient) using a sorted list of suffixes and cancelling out adjacent prefixes. I found an implementation in python which solves the problem; although, it uses strings instead of lists: http://mmhs.ca/ccc/2003/S4Substringscl.txt – Ryan Jul 07 '13 at 17:07
@Mod: Implementation would be a little bit lengthy. I'm afraid I cannot describe it here. As for references, get any string processing book or read this pdf: ["Suffix Trees and Suffix Arrays" by Srinivas Aluru](http://www.eecs.ucf.edu/~shzhang/Combio09/suffix.pdf). – Evgeny Kluev Jul 07 '13 at 17:17
I did not understand the meaning of length of edge. – Mod Jul 07 '13 at 17:27
@Mod: Simple suffix trees contain long sequences of nodes, one node for each character (or number from your array). They cannot be constructed efficiently. Ukkonen's or McCreight's algorithm construct suffix trees in "compressed" form, when only tree branches are stored as nodes but all "linear" paths are stored as single edges. Number of nodes in "not compressed" tree corresponding to an edge determine its length. See referred pdf for better explanation. – Evgeny Kluev Jul 07 '13 at 17:36
Here's a way to understand why the sum of all edge lengths is the number you want. Consider all the suffixes of the array: each one corresponds to some path from the root of the suffix tree to a leaf. For a given suffix of length k, which will have root-to-leaf path length k on the tree, each of the k possible prefixes of it corresponds to a subarray (substring) -- so if we somehow knew that all these prefixes-of-suffixes were distinct, we could just add up the path length of every suffix. But two suffixes *can* share a prefix... – j_random_hacker Jul 07 '13 at 19:38
... (e.g. `bcdbce` and `bce`, both suffixes of `abcdbce`, have the prefix `bc` in common), and we want to count such substrings only once. To ensure we do this, instead take each suffix (in no particular order) in turn and colour in each edge from root to leaf for that suffix, *but only add the lengths of edges that have not already been coloured in to the total*. (So e.g. when we come to process the suffix `bce` we find that the edge from the root labelled `bc` has already been coloured in by our earlier processing of `bcdbce`, so we avoid adding 2 the second time.)... – j_random_hacker Jul 07 '13 at 19:39
... After processing all suffixes this way, we find that every edge in the tree has been added to the total exactly once -- i.e. the total we have is just the sum of the edge lengths in the tree. – j_random_hacker Jul 07 '13 at 19:39
I don't see how a suffix tree is sufficient. in the example given (A = [1, 2, 1, 2]), [2, 1] does not appear anywhere in the suffix tree. – njzk2 Jul 08 '13 at 08:11
@njzk2: since [2 1 2] is a suffix, path "root-2-1-2" should be present in suffix tree. [2 1] is prefix of this path, so it should also be present in suffix tree. With one little complication: in "compressed" form of suffix tree "2" is a node, but "1" is part of an edge. – Evgeny Kluev Jul 08 '13 at 08:25
@EvgenyKluev : OK, thanks for the clarification. In this case, I still don't understand how the node 1 can be considered identical between the 1- prefix and the 2-1-2 suffix ? – njzk2 Jul 08 '13 at 08:45
@njzk2: sorry, I didn't understand this last question. – Evgeny Kluev Jul 08 '13 at 08:50
part of the initial question is to identify unique substrings. [1] is one of those. it is necessary to see at some point that the first and last occurrences of [1] should be considered as only 1 substring. I don't get in the suffix tree how this is done. I have drawn a suffix tree for the example given, and I don't get how this computes into the expected list of substrings. – njzk2 Jul 08 '13 at 09:02
@njzk2: [1] is counted only once, as a node nearest to the root. While it is more efficient to use "compact" form of suffix tree, you'd better "uncompress" it to understand how it works. To get all unique substrings, just trace paths **from root** to every node. You get all the substrings because you get all prefixes of all suffixes. And they are unique because suffix tree does not store duplicates (by design). So you just need to count the nodes (or add together edge lengths in "compact" form). – Evgeny Kluev Jul 08 '13 at 09:16

njzk2 · Answer 2 · 2013-07-08T08:16:21.097

1

You could trivially make a set of the subsequences and count them, but i'm not certain it is the most efficient way, as it is O(n^2).

in python that would be something like :

subs = [tuple(A[i:j]) for i in range(0, len(A)) for j in range(i + 1, len(A) + 1)]

uniqSubs = set(subs)

which gives you :

set([(1, 2), (1, 2, 1), (1,), (1, 2, 1, 2), (2,), (2, 1), (2, 1, 2)])

The double loop in the comprehension clearly states the O(n²) complexity.

Edit

Apparently there are some discussion about the complexity. Creation of subs is O(n^2) as there are n^2 items.

Creating a set from a list is O(m) where m is the size of the list, m being n^2 in this case, as adding to a set is amortized O(1).

The overall is therefore O(n^2).

edited Jul 08 '13 at 08:16

answered Jul 07 '13 at 15:31

njzk2

38,969
7
69
107

Thank you , njxk2 but I would like a better complexity but still +1. Oops still can't upvote. – Mod Jul 07 '13 at 15:33
2

I don't get how is it O(N^2). You make a set of subsequences which is O(n^2) and comparing each subsequence with another. It is then becomes O(N^4). – Shashwat Kumar Jul 07 '13 at 15:37
I would say it would be O(n^2 log n) since inserting an element takes O(log n) in a set. – Mod Jul 07 '13 at 15:40
1

@Mod The comparison here is not O(1) it take O(n) time to check if two list are identical. That make the algorithm O(n^3 log(n)) – banarun Jul 07 '13 at 15:42
erf that's quite a better solution than mine^^ +1 – skoll Jul 07 '13 at 16:51
The problems is not the number of equal comparisons(since `set` use hashing only few sequences are effectively compared; most comparisons are avoided), but the time taken to compute the hashes, which is `O(n)`. This should lead to a `O(n^3)` average complexity for this solution. – Bakuriu Jul 07 '13 at 19:44
i maintain the o(n^2) complexity. the creation of the `subs` array is trivially o(n^2), and adding an element to a set is o(1), which makes the set creation also o(n^2). both operation are sequential, not nested, so the overall average case complexity is still o(n^2). – njzk2 Jul 08 '13 at 08:04
"the creation of the subs array is trivially o(n^2)" this is obviously false. the operation `A[i:j]` in the definiton of `subs` takes `O(j - i)`, not `O(1)`. indeed, the total memory usage of `subs` is `O(n^3)` so there is no way the time taken to create it can be faster – xuanji Mar 01 '22 at 14:16

skoll · Answer 3 · 2013-07-07T18:41:31.483

Edit: I think about how to reduce iteration/comparison number. I foud a way to do it: if you retrieve a sub-array of size n, then each sub-arrays of size inferior to n will already be added.

Here is the code updated.

    List<Integer> A = new ArrayList<Integer>();
    A.add(1);
    A.add(2);
    A.add(1);
    A.add(2);

    System.out.println("global list to study: " + A);

    //global list
    List<List<Integer>> listOfUniqueList = new ArrayList<List<Integer>>();      

    // iterate on 1st position in list, start at 0
    for (int initialPos=0; initialPos<A.size(); initialPos++) {

        // iterate on liste size, start on full list and then decrease size
        for (int currentListSize=A.size()-initialPos; currentListSize>0; currentListSize--) {

            //initialize current list.
            List<Integer> currentList = new ArrayList<Integer>();

            // iterate on each (corresponding) int of global list
            for ( int i = 0; i<currentListSize; i++) {
                currentList.add(A.get(initialPos+i));
            }

            // insure unicity
            if (!listOfUniqueList.contains(currentList)){
                listOfUniqueList.add(currentList);                      
            } else {
                continue;
            }
        }
    }

System.out.println("list retrieved: " + listOfUniqueList);
System.out.println("size of list retrieved: " + listOfUniqueList.size());

global list to study: [1, 2, 1, 2]

list retrieved: [[1, 2, 1, 2], [1, 2, 1], [1, 2], [1], [2, 1, 2], [2, 1], [2]]

size of list retrieved: 7

With a list containing the same patern many time the number of iteration and comparison will be quite low. For your example [1, 2, 1, 2], the line if (!listOfUniqueList.contains(currentList)){ is executed 10 times. It only raise to 36 for the input [1, 2, 1, 2, 1, 2, 1, 2] that contains 15 different sub-arrays.

in order to help for optimization, I should precize that this algorithm took 8436 iteration for an array of 36 elements — skoll, Jul 07 '13 at 16:56
the problem here is the complexity of List.contains, which could be replaced by a HashSet (contains would become o(1) instead of o(n)). — njzk2, Jul 08 '13 at 08:03

user1646196 · Answer 4 · 2013-07-07T15:28:02.970

0

Right my first answer was a bit of a blonde moment.

I guess the answer would be to generate them all and then remove duplicates. Or if you are using a language like Java with a set object make all the arrays and add them to a set of int[]. Sets only contain one instance of each element and automatically remove duplicates so you can just get the size of the set at the end

edited Jul 07 '13 at 15:28

answered Jul 07 '13 at 15:15

user1646196

589
6
28

The OP wants the number of _distinct_ sub _arrays_, not sub _sets_. (which BTW has an upper bound of (N-1)*N/2, IICC) – wildplasser Jul 07 '13 at 15:16
subarray != subset, as your answer suggests. subset is a group of items from the initial collection (set or array). subarray is a subgroup that preserve the order and the continuity. – njzk2 Jul 07 '13 at 15:24
My bad I misunderstood the question – user1646196 Jul 07 '13 at 15:26

score 0 · Answer 5 · answered Jul 07 '13 at 16:06

I can think of 2 ways...

first is compute some sort of hash then add to a set. if on adding your hashes are the same is an existing array... then do a verbose comparison... and log it so that you know your hash algorithm isn't good enough...

The second is to use some sort of probable match and then drill down from there... if number of elements is same and the total of the elements added together is the same, then check verbosely.

score 0 · Answer 6 · answered Jul 07 '13 at 16:19

Create an array of pair where each pair store the value of the element of subarray and its index.

pair[i] = (A[i],i);

Sort the pair in increasing order of A[i] and then decreasing order of i.

Consider example A = [1,3,6,3,6,3,1,3];
pair array after sorting will be pair = [(1,6),(1,0),(3,7),(3,5),(3,3),(3,1),(6,4),(6,2)]

pair[0] has element of index 6. From index 6 we can have two sub-arrays [1] and [1,3]. So ANS = 2;
Now take each consecutive pair one by one.
Taking pair[0] and pair[1],
pair[1] has index 0. We can have 8 sub-arrays beginning from index 0. But two subarrays [1] and [1,3] are already counted. So to remove them, we need to compare longest common prefix of sub-array for pair[0] and pair[1]. So longest common prefix length for indices beginning from 0 and 6 is 2 i.e [1,3].
So now new distinct sub-arrays will be [1,3,6] .. to [1,3,6,3,6,3,1,3] i.e. 6 sub-arrays. So new value of ANS is 2+6 = 8;

So for pair[i] and pair[i+1]
ANS = ANS + Number of sub-arrays beginning from pair[i+1] - Length of longest common prefix.

The sorting part takes O(n logn).
Iterating each consecutive pair is O(n) and for each iteration find longest common prefix takes O(n) making whole iteration part O(n^2). Its the best I could get.

You can see that we dont need pair for this. The first value of pair, value of element was not required. I used this for better understanding. You can always skip that.

Number of Distinct Subarrays

6 Answers6

Edit