0

How can one write a helper method that for a given string like foo[0][1][2][3] splits it into the array's name and collection (e.g. vector) of indices? In the example above it should produce foo and 0, 1, 2, 3 respectively.

The format of a string is always like name[index_0][index_1]....[index_n]. The number of indices (n) is not known in advance. All should be numbers. For simplicity, no spaces are allowed in a string. The name of an array (name) can be arbitrary. The helper function throws in case the string is not according to the specified format.

Performance is not an issue here. I'm looking for the most elegant/short solution.

UPDATE

Well, regular expressions were suggested in the very first comment. I was new to the area and went through the hassle of getting it done in C++. Please, feel free to simplify it. In the meantime, two non-regular expressions based solutions were proposed by @MartinYork and @Frodyne. At a first glance, regular expressions didn't bring anything fascinating here. The solution does not seem to be much shorter or much more elegant in my mind.

#include <stdexcept>
#include <iostream>
#include <string>
#include <regex>
#include <tuple>

std::tuple<std::string, std::vector<int>> helper(std::string str) {
  // used to validate that the incoming string is in format
  // array[0][1][2]
  const std::regex rx_validate{
      "([[:alnum:]]+)((?:\\[[[:digit:]]+\\])+)$"};

  std::match_results<std::string::const_iterator> match_results;
  std::regex_search(str, match_results, rx_validate);

  // regex_search array[0][1][2] gives
  // match_results[0]: array[0][1][2]
  // match_results[1]: array
  // match_results[2]: [0][1][2]
  if (match_results.size() == 3) {
    std::vector<int> indices;

    // used to extract indices, it is guaranteed that
    // numbers are between brackets, no extra checks
    // needed
    const std::regex rx_index{"[0-9]+"};
    const std::string match{match_results[2]};
    auto it = std::sregex_iterator(match.begin(), match.end(), rx_index);
    for (; it != std::sregex_iterator(); ++it)
      indices.push_back(std::stoi((*it).str()));

    return std::make_tuple(match_results[1], indices);
  } else {
    throw std::invalid_argument("Invalid format (" + str + ")");
  }
}

int main() {
  const std::string str{"a[0][1][2][3][4][5]"};
  const auto tuple = helper(str);

  std::cout << "Name: " << std::get<0>(tuple) << std::endl;
  for (int index: std::get<1>(tuple))
    std::cout << index << std::endl;
}

UPDATE2

@rici suggested a modification of an algorithm that used regular expressions. It's a bit shorter and more concise.

I was really interested in comparing those algorithms in terms of performance.

Not going to advocate the numbers :-) Everybody should decide on their own.

The below program compiled as g++ -std=c++11 -Ofast and running on i7-8550U gives:

Regex measurements...
min/max/avg 955/154859/1072.88
Stream measurements...
min/max/avg 722/41252/800.402
#include <iostream>
#include <cstdlib>
#include <cstdint>
#include <limits>
#include <string>
#include <vector>
#include <regex>
#include <tuple>

#include <time.h>

inline uint64_t Timestamp() {
  timespec time_now;
  clock_gettime(CLOCK_REALTIME, &time_now);
  return static_cast<uint64_t>(time_now.tv_sec) * 1000000000ULL + time_now.tv_nsec;
}

std::tuple<std::string, std::vector<int>> helper_stream(std::string const& info)
{
    std::stringstream is(info);
    std::string         name;
    std::vector<int>    index;

    if (std::getline(is, name, '[')) {
        is.putback('[');
        name.erase(std::remove(std::begin(name), std::end(name), ' '), std::end(name));

        int   value;
        char  b1;
        char  b2;
        while(is >> b1 >> value >> b2 && b1 == '[' && b2 == ']') {
            index.push_back(value);
        }
    }
    return std::make_tuple(name, index);
}

std::tuple<std::string, std::vector<int>> helper_regex(std::string str) {
    static const std::regex strip_prefix{"^[[:alpha:]][[:alnum:]]*"};
    static const std::regex index{"\\[([[:digit:]]+)\\]|."};
    std::match_results<std::string::const_iterator> match;
    if (std::regex_search(str, match, strip_prefix)) {
        auto e = match[0].second;
        std::vector<int> indices;
        for (auto it = std::sregex_iterator(e, str.end(), index), lim = std::sregex_iterator(); it != lim; ++it) {
            if ((*it)[1].matched)
                indices.push_back(std::stoi((*it)[1]));
            else throw std::invalid_argument("Invalid format");
        }
        return std::make_tuple(std::string(str.cbegin(), e), indices);
    }
    else
        throw std::invalid_argument("Invalid format (" + str + ")");
}

std::string make_str(int n) {
  std::string str{"array"};

  for (int i = 0; i < n; ++i) {
    str += "[";
    str += std::to_string(std::rand());
    str += "]";
  }

  return str;
}

template <typename F>
void measurements(F f) {
  constexpr int kNumRounds = 1000000;
  constexpr int kLength = 3;

  std::vector<uint64_t> time_diffs(kNumRounds);

  for (int i = 0; i < kNumRounds; ++i) {
    const std::string str{make_str(kLength)};

    const auto before = Timestamp();
    f(str);
    const auto after = Timestamp();

    time_diffs[i] = after - before;
  }

  uint64_t min{std::numeric_limits<uint64_t>::max()};
  uint64_t max{std::numeric_limits<uint64_t>::min()};
  uint64_t sum{0};

  for (int i = 0; i < kNumRounds; ++i) {
    const auto time_diff = time_diffs[i];

    if (time_diff < min)
      min = time_diff;

    if (time_diff > max)
      max = time_diff;

    sum += time_diff;
  }

  std::cout << "min/max/avg " << min << "/" << max << "/" << static_cast<double>(sum) / kNumRounds << std::endl;
}

int main() {
  std::cout << "Regex measurements..." << std::endl;
  measurements(helper_regex);

  std::cout << "Stream measurements..." << std::endl;
  measurements(helper_stream);

  return 0;
}
TruLa
  • 1,031
  • 11
  • 21
  • 5
    Use a regular expression. – Vlad from Moscow Jul 24 '19 at 13:30
  • 3
    @VladfromMoscow: [Now they have two problems!](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems):-) – AndyG Jul 24 '19 at 13:38
  • 1
    Could you define the format of the string more exactly. Will it always have `foo`? will it always have four index's? Will the indexes always be integers? Are spaces allowed/required in any parts of the string? What are the types and formats of the output? – Martin York Jul 24 '19 at 13:45
  • @MartinYork, thank you for your comment. I'll update the original post. – TruLa Jul 24 '19 at 13:48
  • answers in this [SO thread](https://stackoverflow.com/questions/289347/using-strtok-with-a-stdstring) might give you some idea. – dorKKnight Jul 24 '19 at 14:07
  • Well, regular expressions were suggested in the very first comment. I was new to the area and went through the hassle of getting it done in C++. I'll update the original post and include my solution. Please, feel free to simplify it. In the meantime, two non-regular expressions based solutions were proposed by @MartinYork and @ Frodyne. At a first glance, regular expressions didn't bring anything fascinating here. The solution does not seem to be much shorter or much more elegant in my mind. – TruLa Jul 24 '19 at 15:40
  • 1
    @TruLa: I think the regex solution can be somewhat more elegant or at least shorter. For example, http://coliru.stacked-crooked.com/a/7c6aedb1ecec2e36. Note that it is simple to adapt the regex solution to, for example, accept whitespace inside the expression. – rici Jul 24 '19 at 17:26

2 Answers2

2

This is one of the few times I would advocate falling back to the C parsing functions. Though it could be done by regular expression this seems a bit heavy weight for something so trivial.

I would use the C function sscanf()

std::tuple<std::string, std::vector<int>> ck1(std::string const& info)
{

    int                 functionStartSize = 0;
    int                 functionNameSize = 0;
    char                check = 'X';
    std::vector<int>   index;

    if (std::sscanf(info.data(), " %n%*[^\[]%n%c", &functionStartSize, &functionNameSize, &check) == 1 && check == '[') {

        // Format String: " %n%*[^\[]%n%c"
        // ' ':        Ignore all leading space.
        // %n:         Save number of characters of space we dropped.
        // %*[^\[]:    Lets split this up
        //             %*      scan but don't save to a variable.
        //             [..]    Only the letters we find inside the brackets.
        //             ^\]     Everything except ]
        // %n:         Save the number of characters we have used to here.
        // %c:         A character This should now be a '['
        // We have correctly found the beginning and end of the name.

        int size;
        int value;
        int offset = functionNameSize;
        while(std::sscanf(info.data() + offset, "[%d%c%n", &value, &check, &size) == 2 && check == ']') {
            // We have found another index
            index.push_back(value);
            offset += size;
        }
    }
    return std::make_tuple(info.substr(functionStartSize, (functionNameSize-functionStartSize), index);
}

When I was first writing the above code I assumed that %n would count just like any other parameter. Unfortunately it does not count towards the return value. This made the check for each index slightly more obscure and thus I am not thinking it is better to use the stream one below.

The streams don't do that bad:
A full copy of the string into the string stream. But for small strings not a big issue.

std::tuple<std::string, std::vector<int>> ck2(std::string const& info)
{
    std::stringstream is(info);
    std::string         name;
    std::vector<int>    index;

    if (std::getline(is, name, '[')) {
        is.putback('[');
        name.erase(std::remove(std::begin(name), std::end(name), ' '), std::end(name));

        int   value;
        char  b1;
        char  b2;
        while(is >> b1 >> value >> b2 && b1 == '[' && b2 == ']') {
            index.push_back(value);
        }
    }
    return std::make_tuple(name, index);
}
Martin York
  • 257,169
  • 86
  • 333
  • 562
  • Thank you. I updated my original post. I agree with you. I cannot see benefits in using regular expressions to solve this problem. – TruLa Jul 24 '19 at 15:46
  • That's `std::sscanf()`, assuming you're including the C++ header ``. The compatibility header ``, isn't recommended for new code. – Toby Speight Jul 24 '19 at 16:33
  • @TobySpeight Yes. Is there a flag to check that? – Martin York Jul 24 '19 at 20:25
  • There's [no way](//stackoverflow.com/q/44694574) to catch that error using GCC. The only [useful suggestion](//stackoverflow.com/q/44694574#comment79116631_44694574) was to build with Oracle's Studio compiler on Solaris. (Sadly, I no longer have any Solaris boxes, and I never had that compiler...) – Toby Speight Jul 25 '19 at 10:51
0

My answer is fairly similar to the one by Martin York, but I used stl instead.

#include <iostream>
#include <vector>
#include <string>
#include <tuple>

std::tuple<std::string, std::vector<int>> getNameIndices(std::string s)
{
    std::vector<int> indices;

    // The name must end at the first '['
    size_t pos = s.find("[");
    // If we can't find that, then it isn't a valid string - return empty
    if (pos == std::string::npos)
        return std::make_tuple("", indices);

    // Get the name and remove it from the string
    std::string name = s.substr(0, pos);
    s.erase(0, pos + 1);

    size_t begin = 0;
    // Keep looping as long as we can find the start of a new index
    while ((pos = s.find("]")) != std::string::npos)
    {
        // Begin is the position of the '[', pos is the ']': Get the text between them
        std::string tmp = s.substr(begin, pos - begin);
        indices.push_back(stoi(tmp));
        // Remove the characters that were matched, and update 'begin'
        s.erase(0, pos + 1);
        begin = s.find("[") + 1;
    }

    // Return the name and indices in a vector
    return std::make_tuple(name, indices);
}

void main()
{
    std::string s = "foo[500][12][2][13]";

    auto b = getNameIndices(s);

    std::cout << "Name: " << std::get<0>(b) << std::endl;
    for (int i : std::get<1>(b))
    {
        std::cout << "\t" << i << std::endl;
    }
}
Frodyne
  • 3,547
  • 6
  • 16
  • If you are going to use the STL you should be using `std::stringstream` and `operator >>` and `std::getline()` to extract the parts. – Martin York Jul 24 '19 at 16:02
  • @MartinYork Having seen your STL solution, I agree: That way looks much cleaner and better. – Frodyne Jul 25 '19 at 07:05