1

This is a problem about substrings that I created. I am wondering how to implement an O(nlog(n)) solution to this problem because the naive approach is pretty easy. Here is how it goes. You have a string S. S has many substrings. In some substrings, the first character and last character are there more than once. Find how many substrings where the first and last character are there more than once.

Input: "ABCDCBE"
Expected output: 2
Explanation: "BCDCB" and "CDC" are two such substrings

That test case explanation only has "BCDCB" and "CDC" where first and last char are same.

There can be another case aside from the sample case with "ABABCAC" being the substring where the first character "A" appears 3 times and the last character "C" appears twice. "AAAABB" is also another substring.

"AAAAB" does not satisfy.

What I have learned that is O(nlog(n)) that might or might not contribute to solution is Binary Indexed Trees. Binary Indexed Trees can somehow be used to solve this. There is also sorting and binary search, but first I want to focus especially on Binary Indexed Trees.

I am looking for a space complexity of O(n log(n)) or better.

Also Characters are in UTF-16

halcyon44
  • 13
  • 4
  • The question is not very clear to me: in the beginning, you talk about finding only "how many substrings where the first and last character are there more than once", and then you talk about an "equally valid" case where characters in the middle of the substring matters. It seems inconsistent. Please clarify this point. – Jérôme Richard Apr 04 '21 at 12:52
  • Gladly. I mean that the sample cases only have substrings where the first and last are equal to each other but I wanted to clarify a case where the first character repeated more than once but was not at the end and the last character repeated more than once but was not at the beginning. This case is not for a sample but a new case for clarafication. – halcyon44 Apr 04 '21 at 14:46
  • Ok. Thank you for this answer. Can you update the question to specify what output you expect (and why) for this case "ABABCAC". I guess the output should be 5 here. What about for "AAAABB" too? Is it 7? (my implicit question is: is it ok to count substrings appearing twice or more) – Jérôme Richard Apr 04 '21 at 15:01
  • AAAABB also satisfies the condition – halcyon44 Apr 04 '21 at 15:30
  • It is okay to count the same substring twice so AAAABB has AAAABB AAABB AABB BB AAAA AAA AAA AA AA AA – halcyon44 Apr 04 '21 at 15:44
  • From what I understand your question is, if it's `n` of the same letter, then the output will be `O(n^2)` like a completely connected graph? – Neil Apr 04 '21 at 16:35
  • Yes it would be like n(n-1)/2. You don't have to simulate it though. There are brute force methods but also O(nlog(n)) method which I am seeking. – halcyon44 Apr 04 '21 at 17:05
  • 1
    @Neil note that the output is a number of filtered substring and not the list of all the possible solutions. – Jérôme Richard Apr 04 '21 at 17:22

1 Answers1

1

The gist of my solution is as follows:

Iterate over the input array, and, for each position, compute the amount of 'valid' substrings that end on that position. The sum of these values is the total amount of valid substrings. We achieve this by counting the amount of valid starts to a substring, that come before the current position, using a Binary Indexed Tree.

Now for the full detail:

As we iterate over the array we think of the current element as the end of a substring, and we say that the positions that are a valid start are those such that its value appears again between it, and the position we are currently iterating over. (i.e. if the value at the start of a substring appears at least twice in it)

For example:

current index              V
data  = [1, 2, 3, 4, 1, 4, 3, 2]
valid = [1, 0, 1, 1, 0, 0, 0, 0]
         0  1  2  3  4  5  6  7

The first 1 (at index 0) is a valid start, because there is another 1 (at index 4) after it, but before the current index (index 6).

Now, counting the amount of valid starts that come before the current index gives us something pretty close to what we wanted, except that we may grab some substrings that don't have two appearances of the last value of the substring (i.e. the one we are currently iterating over)

For example:

current index              V
data  = [1, 2, 3, 4, 1, 4, 3, 2]
valid = [1, 0, 1, 1, 0, 0, 0, 0]
         0  1  2  3  4  5  6  7
                  ^--------^

Here, the 4 is marked as a valid start (because there is another 4 that comes after it), but the corresponding substring does not have two 3s.

To fix this, we shall only consider valid starts up to the previous appearance of the current value. (this means that the substring will contain both the current value, and its previous appearance, thus, the last element will be in the substring at least twice)

The pseudocode goes as follows:

fn solve(arr) {
  answer := 0
  for i from 1 to length(arr) {
    previous_index := find_previous(arr, i)

    if there is a previous_index {
      arr[previous_index].is_valid_start = true
      answer += count_valid_starts_up_to_and_including(arr, previous_index)
    }
  }
  return answer
}

To implement these operations efficiently, we use a hash table for looking up the previous position of a value, and a Binary Indexed Tree (BIT) to keep track of and count the valid positions.

Thus, a more fleshed out pseudocode would look like

fn solve(arr) {
  n := length(arr)

  prev := hash_table{}
  bit  := bit_indexed_tree{length = n}

  answer := 0
  for i from 1 to length(arr) {
    value := arr[i]
    previous_index := prev[value]

    if there is a previous_index {
      bit.update(previous_index, 1)
      answer += bit.query(previous_index)
    }

    prev[value] = i
  }
  return answer
}

Finally, since a pseudocode is not always enough, here is an implementation in C++, where the control flow is a bit munged, to ensure efficient usage of std::unordered_map (C++'s built-in hash table)

class Bit { 
    std::vector<int> m_data;
public:
    // initialize BIT of size `n` with all 0s
    Bit(int n);

    // add `value` to index `i`
    void update(int i, int value);

    // sum from index 0 to index `i` (inclusive)
    int query(int i);
};

long long solve (std::vector<int> const& arr) {
    int const n = arr.size();

    std::unordered_map<int, int> prev_index;
    Bit bit(n);

    long long answer = 0;
    int i = 0;
    for (int value : arr) {

        auto insert_result = prev_index.insert({value, i});
        if (!insert_result.second) { // there is a previous index
            int j = insert_result.first->second;

            bit.update(j, 1);
            answer += bit.query(j);

            insert_result.first->second = i;
        }

        ++i;
    }

    return answer;
}

EDIT: For transparency, here is the Fenwick tree implementation i used to test this code

struct Bit {
    std::vector<int> m_data;
    Bit(int n) : m_data(n+2, 0) { }
    int query(int i) {
        int res = 0;
        for(++i; i > 0; i -= i&-i) res += m_data[i];
        return res;
    }
    void update(int i, int x) {
        for(++i; i < m_data.size(); i += i&-i) m_data[i] += x;
    }
};