Randomly selecting specific subsequence from string

Question

Given a string containing a number of characters interspersed with dashes, for example string s = "A--TG-DF----GR--";, I wish to randomly select a block of dashes (could be of size 1, 2, …, max number of consecutive dashes in string), and copy them over to another part of the string chosen at random.

For example, A--TG-DF(---)-GR-- gets moved to A--T(---)G-DF-GR-- while another iteration may give A--TG-DF----GR(--) gets moved to A--TG-(--)DF----GR.

I'm generating random indices of the string through int i = rand() % (int) s.length();. Insertion happens through s.insert(rand() % (int) s.length(), substr);, where substr is the block of dashes.

My main problem lies in finding randomly a continuous group of dashes. I thought of using s.find("-"), but that'd only return the first instance of a single dash, and not a random position of a collection of dashes.

You could pre-process the string to collect a `std::vector` of iterators that point to the beginnings of the dash-sequences in the string. Then uniformly select an element from that vector. — 5gon12eder, Feb 08 '15 at 18:16

score 4 · Accepted Answer · edited Mar 20 '17 at 10:29

I know this problem is likely steeped in XY problems, but I found it a nice challenge none-the-less, so I thought about implementing it with the Boost Interval Container library.

The beauty of the library is that you can forget about a lot of details, while you don't sacrifice a lot of performance.

I've taken the liberty to generalize it, so that it is capable of moving multiple blocks of dashes (uniform randomly selected) simultaneously.

The solution runs Live On Coliru and generates 1,000,000 random transpositions of the given sample with randomly varied numbers of moved blocks (1..3) in about 2673 ms (1156 ms on my machine):

Generator gen(test_case);

std::string result;
std::map<std::string, size_t> histo;

for(int i = 0; i < 1000000; ++i) {
    auto const mobility = gen.gen_relocations(1 + gen.random(3)); // move n blocks of dashes

    result.clear();
    gen.apply_relocations(mobility, std::back_inserter(result));

    histo[result]++;
}

Note: the benchmark times include the time taken to build the histogram of unique results generated

Let's do a code walk-through here to explain things:

I tried to use "readable" types:

namespace icl = boost::icl;

using Position = size_t;
using Map      = icl::interval_set<Position>;
using Region   = Map::value_type;

E.g. the function that builds the Map of dashes is simply:

template <typename It> Map region_map(It f, It l) {
    Map dashes;

    for (Position pos = 0; f!=l; ++f, ++pos)
        if ('-' == *f)
            dashes += pos;

    return dashes;
}

Note how I didn't particularly optimize this. I let the interval_set combine adjacent dashes. We might use hinted insertion, or a parser that add consecutive dashes as a block. I opted for KISS here.

Later on, we generate relocations, which map a Region onto a non-moving Position in the rest of the text.
```
using Relocs   = boost::container::flat_multimap<Position, Region>;
```
By using the flat multimap, the caller has the entries already sorted by ascending insertion point. Because we use a reserve()-ed up-front flat multimap, we avoid the overhead of a node based implementation of map here.

We start by picking the dash-blocks to be moved:

Map pick_dashes(int n) {
    Map map;
    if (!_dashes.empty())
        for (int i = 0; i < n; ++i)
            map += *std::next(_dashes.begin(), _select(_rng));
    return map;
}

The random distribution have been dimensioned at construction, e.g.:

  _dashes(region_map(_input.begin(), _input.end())),
  _rng(std::random_device {}()),
  _select (0, _dashes.iterative_size() - 1),
  _randpos(0, _input.size() - 1),

Next, we assign insertion-positions to each. The trick is to assign positions inside non-mobile (inert) regions of the source.

this includes other blocks of dashes that stay in their place
there is the degenerate case where everything is a block of dashes, we detected this in the constructor:
```
  _is_degenerate(cardinality(_dashes) == _input.size())
```

So the code reads as follows:

Relocs gen_relocations(int n=1) {
    Map const moving = pick_dashes(n);

    Relocs relocs;
    relocs.reserve(moving.iterative_size());

    if (_is_degenerate)
    {
        // degenerate case (everything is a dash); no non-moving positions
        // exist, just pick 0
        for(auto& region : moving)
            relocs.emplace(0, region);
    } else {
        auto inertia = [&] {
            Position inert_point;
            while (contains(moving, inert_point = _randpos(_rng)))
                ; // discard insertion points that are moving
            return inert_point;
        };

        for(auto& region : moving)
            relocs.emplace(inertia(), region);
    }

    return relocs;
}

Now all we need to do is apply the relocations.

The generic implementation of this is pretty straightforward. Again, it's not particularly optimized in order to keep it simple (KISS):

template <typename F>
    void do_apply_relocations(Relocs const& mobility, F const& apply) const {
        icl::split_interval_set<Position> steps {{0, _input.size()}};

        for (auto& m : mobility) {
            steps += m.first; // force a split on insertion point
            steps -= m.second; // remove the source of the move
            //std::cout << m.second << " moving to #" << m.first << ": " << steps << "\n";
        }

        auto next_mover = mobility.begin();

        for(auto& step : steps) {
            while (next_mover!=mobility.end() && contains(step, next_mover->first))
                apply((next_mover++)->second, true);

            apply(step, false);
        }
    }

Note The trick here is that we "abuse" the split_interval_set combining strategy to break the processing into sub-ranges that "stop" at the randomly generated insertion points: these artificial regions are the "steps" in our generation loop.

The apply function there is what we implement to get what we want. In our case we want a string like A--TG-DFGR(----)-- so we write an overload that appends to a container (e.g. std::string) using any output iterator:

template <typename Out>
    Out apply_relocations(Relocs const& mobility, Out out) const {
        if (_is_degenerate)
            return std::copy(_input.begin(), _input.end(), out);

        auto to_iter = [this](Position pos) { return _input.begin() + pos; };

        do_apply_relocations(mobility, [&](Region const& r, bool relocated) {
            if (relocated) *out++ = '(';
            out = std::copy(
                to_iter(first(r)),
                to_iter(last(r)+1),
                out
            );
            if (relocated) *out++ = ')';
        });

        return out;
    }

Note The "complicated" part here are mapping the Position to input iterators (to_iter) and the code to optionally add () around moved blocks.

And with that, we have seen all the code.

Full Listing

#include <boost/container/flat_map.hpp>
#include <boost/icl/interval_set.hpp>
#include <boost/icl/split_interval_set.hpp>
#include <boost/icl/separate_interval_set.hpp>
#include <boost/lexical_cast.hpp>
#include <boost/range/algorithm.hpp>
#include <iomanip>
#include <iostream>
#include <random>
#include <map>
#include <chrono>

namespace icl = boost::icl;

using Position = size_t;
using Map      = icl::interval_set<Position>;
using Region   = Map::value_type;
using Relocs   = boost::container::flat_multimap<Position, Region>;

struct Generator {
    Generator(std::string const& input) 
        : _input(input),
          _dashes(region_map(_input.begin(), _input.end())),
          _rng(std::random_device {}()),
          _select (0, _dashes.iterative_size() - 1),
          _randpos(0, _input.size() - 1),
          _is_degenerate(cardinality(_dashes) == _input.size())
    {
    }

    unsigned random(unsigned below) {
        return _rng() % below; // q&d, only here to make the tests deterministic for a fixed seed
    }

    Map full() const { 
        return Map { { 0, _input.size() } };
    }

    Relocs gen_relocations(int n=1) {
        Map const moving = pick_dashes(n);

        Relocs relocs;
        relocs.reserve(moving.iterative_size());

        if (_is_degenerate)
        {
            // degenerate case (everything is a dash); no non-moving positions
            // exist, just pick 0
            for(auto& region : moving)
                relocs.emplace(0, region);
        } else {
            auto inertia = [&] {
                Position inert_point;
                while (contains(moving, inert_point = _randpos(_rng)))
                    ; // discard insertion points that are moving
                return inert_point;
            };

            for(auto& region : moving)
                relocs.emplace(inertia(), region);
        }

        return relocs;
    }

    template <typename Out>
        Out apply_relocations(Relocs const& mobility, Out out) const {
            if (_is_degenerate)
                return std::copy(_input.begin(), _input.end(), out);

            auto to_iter = [this](Position pos) { return _input.begin() + pos; };

            do_apply_relocations(mobility, [&](Region const& r, bool relocated) {
                if (relocated) *out++ = '(';
                out = std::copy(
                    to_iter(first(r)),
                    to_iter(last(r)+1),
                    out
                );
                if (relocated) *out++ = ')';
            });

            return out;
        }

    template <typename F>
        void do_apply_relocations(Relocs const& mobility, F const& apply) const {
            icl::split_interval_set<Position> steps {{0, _input.size()}};

            for (auto& m : mobility) {
                steps += m.first; // force a split on insertion point
                steps -= m.second; // remove the source of the move
                //std::cout << m.second << " moving to #" << m.first << ": " << steps << "\n";
            }

            auto next_mover = mobility.begin();

            for(auto& step : steps) {
                while (next_mover!=mobility.end() && contains(step, next_mover->first))
                    apply((next_mover++)->second, true);

                apply(step, false);
            }
        }

  private:
    std::string                             _input;
    Map                                     _dashes;
    std::mt19937                            _rng;
    std::uniform_int_distribution<Position> _select;
    std::uniform_int_distribution<Position> _randpos;
    bool                                    _is_degenerate;

    Map pick_dashes(int n) {
        Map map;
        if (!_dashes.empty())
            for (int i = 0; i < n; ++i)
                map += *std::next(_dashes.begin(), _select(_rng));
        return map;
    }

    template <typename It> Map region_map(It f, It l) {
        Map dashes;

        for (Position pos = 0; f!=l; ++f, ++pos)
            if ('-' == *f)
                dashes += pos;

        return dashes;
    }
};

int main() {

    for (std::string test_case : {
            "----",
            "A--TG-DF----GR--",
            "",
            "ATGDFGR",
        })
    {
        auto start = std::chrono::high_resolution_clock::now();
        Generator gen(test_case);

        std::string result;
        std::map<std::string, size_t> histo;

        for(int i = 0; i < 1000000; ++i) {
            auto const mobility = gen.gen_relocations(1 + gen.random(3)); // move n blocks of dashes

            result.clear();
            gen.apply_relocations(mobility, std::back_inserter(result));

            histo[result]++;
        }
        std::cout << histo.size() << " unique results for '" << test_case << "'"
                  << " in " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()-start).count() << "ms\n";

        std::multimap<size_t, std::string, std::greater<size_t> > ranked;
        for (auto& entry : histo)
            ranked.emplace(entry.second, entry.first);

        int topN = 10;
        for (auto& rank : ranked)
        {
            std::cout << std::setw(8) << std::right << rank.first << ": " << rank.second << "\n";
            if (0 == --topN)
                break;
        }
    }
}

Prints e.g.

1 unique results for '----' in 186ms
 1000000: ----
3430 unique results for 'A--TG-DF----GR--' in 1156ms
    9251: A(----)--TG-DFGR--
    9226: (----)A--TG-DFGR--
    9191: A--T(----)G-DFGR--
    9150: A--TG-DFGR-(----)-
    9132: A--(----)TG-DFGR--
    9128: A--TG(----)-DFGR--
    9109: A--TG-D(----)FGR--
    9098: A--TG-DFG(----)R--
    9079: A--TG-DFGR(----)--
    9047: A-(----)-TG-DFGR--
1 unique results for '' in 25ms
 1000000: 
1 unique results for 'ATGDFGR' in 77ms
 1000000: ATGDFGR

5gon12eder · Answer 2 · 2015-02-09T03:03:55.063

You can pre-process the string to get a list of iterators that point ot the beginnings of consecutive dashes in the string and then uniformly pick a random element from that list.

I will use the following standard library headers in this example (which is complete and working if you concatenate the following code blocks):

#include <cstddef>
#include <iostream>
#include <random>
#include <stdexcept>
#include <string>
#include <vector>

First, we define a function that finds us said list of iterators. To do so, we make use of std::string::find_first_of and std::string::find_first_not_of to find the index of the first character in and of the first character after the next sequence. Both functions work with indices rather than with iterators, so we have to add them to cbegin(). The function works with any character, not just dashes.

std::vector<std::string::const_iterator>
find_conscutive_sequences(const std::string& text, const char c)
{
  std::vector<std::string::const_iterator> positions {};
  std::size_t idx = 0UL;
  while (idx != std::string::npos && idx < text.length())
    {
      const auto first = text.find_first_of(c, idx);
      if (first == std::string::npos)
        break;
      positions.push_back(text.cbegin() + first);
      idx = text.find_first_not_of(c, first);
    }
  return positions;
}

Next, we define a function that uses the result of the above function to return an iterator to the beginning of a randomly selected sequence of dashes.

We pass in the random engine as a parameter so it can be seeded once and used over and over again. The random standard library introduced in C++11 is so great that it should be preferred whenever possible over the legacy rand function.

If given an empty vector of positions, we have to fail because there is no sequence we could possibly select.

std::string::const_iterator
get_random_consecutive_sequence(const std::vector<std::string::const_iterator>& positions,
                                std::default_random_engine& prng)
{
  if (positions.empty())
    throw std::invalid_argument {"string does not contain any sequence"};
  std::uniform_int_distribution<std::size_t> rnddist {0UL, positions.size() - 1UL};
  const auto idx = rnddist(prng);
  return positions.at(idx);
}

Finally, I define this little helper function to mark the selected sequence. Your code would do the copy / move / shift here.

std::string
mark_sequence(const std::string& text,
              const std::string::const_iterator start)
{
  const auto c = *start;
  const std::size_t first = start - text.cbegin();
  std::size_t last = text.find_first_not_of(c, first);
  if (last == std::string::npos)
    last = text.length();
  std::string marked {};
  marked.reserve(text.length() + 2UL);
  marked += text.substr(0UL, first);
  marked += '(';
  marked += text.substr(first, last - first);
  marked += ')';
  marked += text.substr(last, text.length() - last);
  return marked;
}

It can be used like this.

int
main()
{
  const std::string example {"--A--B-CD----E-F---"};
  std::random_device rnddev {};
  std::default_random_engine rndengine {rnddev()};
  const auto positions = find_conscutive_sequences(example, '-');
  for (int i = 0; i < 10; ++i)
    {
      const auto pos = get_random_consecutive_sequence(positions, rndengine);
      std::cout << mark_sequence(example, pos) << std::endl;
    }
}

Possible output:

--A--B-CD(----)E-F---
--A--B(-)CD----E-F---
--A(--)B-CD----E-F---
--A(--)B-CD----E-F---
--A--B-CD(----)E-F---
--A--B-CD----E-F(---)
--A--B-CD----E-F(---)
(--)A--B-CD----E-F---
--A--B(-)CD----E-F---
(--)A--B-CD----E-F---

In my answer, the code does more (it actually moves _n_ blocks of dashes, and marks them with parens). When I implement [a `mark_sequence()`](http://paste.ubuntu.com/10135138/), benchmarking _just_ the selection of a single block of dashes my code generates identical results in roughly **half the time** (compare [my code](http://coliru.stacked-crooked.com/a/9f3e1e7ecb245e4a) and [your code](http://coliru.stacked-crooked.com/a/4c2cb6a2e851988a) live on Coliru). Kudos for staying with standard library though. +1 — sehe, Feb 09 '15 at 02:11
@sehe Nice job. As mentioned in my (now deleted) last paragraph, the vector could be shared (now implemented) across multiple iterations. If I do this in the benchmark you have set up, your version is still faster but the difference is considerably smaller (mine takes about 30 % longer on my laptop, probably spending most of its time in `mark_sequence`). When I wrote up my answer, performance wasn't a major concern beyond avoiding premature pessimizations (as usual when I answer to beginner's questions) but thanks to your answer, I have learned about *yet another* useful Boost library. — 5gon12eder, Feb 09 '15 at 03:17
Cheers :) For me, that was exactly the reason to use ICL: learning to use this Boost library. I had no optimization in mind. I looked at the problem and I realized things could get really confusing with more "mobile" blocks at a time, getting the insertion points properly distributed. That's the real advantage of using the library: it raises the abstraction level. The other approach I thought of was transposing a list of stringrefs but it would not be as memory efficient. I expect my solution to scale to huge memory-mapped input, whereas the stringrefs approach would likely break down. — sehe, Feb 09 '15 at 07:25

nullptr · Answer 3 · 2015-02-08T18:24:09.040

0

string::find() has optional second parameter: a position to start the search from. So, something like s.find("-", rand() % L) may do the trick for you, where L is (position of the last dash + 1).

edited Feb 08 '15 at 18:24

answered Feb 08 '15 at 18:18

nullptr

11,008
1
23
18

Leandro Caniglia · Answer 4 · 2015-02-08T19:02:33.210

As I understand the problem all dash blocks should have the same probability of being selected. Therefore we must first find the positions where all these blocks start and then pick one of those positions at Random.

If I'm allowed to use Smalltalk for pseudo code, then I would first find the indexes where every dash block starts:

dashPositionsOn: aString
    | indexes i n |
    indexes := OrderedCollection new.
    i := 1.
    n := aString size.
    [i <= n] whileTrue: [| char |
        char := aString at: i.
        char = $-
            ifTrue: [
                indexes add: i.
                [
                    i := i + 1.
                    i <= n and: [
                        char := aString at: i.
                        char = $-]] whileTrue]
            ifFalse: [i := i + 1]].
    ^indexes

Now we can pick one of these indexes at random: indexes atRandom.

Please note that there are (much) better ways to implement this algorithm in Smalltalk (as well as in other languages).

Randomly selecting specific subsequence from string

4 Answers4

Full Listing

Linked