0

I have to process badly mismanaged text with creative indentation. I want to remove the empty (or whitespace) lines at the beginning and end of my text without touching anything else; meaning that if the first or last actual lines respectively begin or end with whitespace, these will stay.

For example, this:

<lines, empty or with whitespaces ...>
<text, maybe preceded by whitespace>
<lines with or without text...>
<text, maybe followed by whitespace>
<lines, empty or with whitespaces ...>

turns to

<text, maybe preceded by whitespace>
<lines with or without text...>
<text, maybe followed by whitespace>

preserving the spaces at the beginning and the end of the actual text lines (the text might also be entirely whitespace)

A regex replacing (\A\s*(\r\n|\Z)|\r\n\s*\Z) by emptiness does exactly what I want, but regex is kind of overkill, and I fear it might cost me some time when processing texts with a lot of lines but not much to trim.

On the other hand, an explicit algorithm is easy to make (just read until a non-whitespace/the end while remembering the last line feed, then truncate, and do the same backwards) but it feels like I'm missing something obvious.

How can I do this?

Moige
  • 169
  • 8
  • Regular expressions are cheap to develop, fast (enough) to run, and well understood. Why bother creating a bespoke algorithm? – Botje Aug 04 '21 at 09:24
  • Start here https://stackoverflow.com/questions/216823/whats-the-best-way-to-trim-stdstring – Jan Gabriel Aug 04 '21 at 09:41
  • @LjisaMoige, the example they give use std::space which includes `\n` and `\r` whereas an empty line would be e.g. consecutive line feeds `\n\n`. See https://en.cppreference.com/w/cpp/string/byte/isspace for what is included in std::space. Thus, replace std::space with a check for `\n` or `\r` If this does not solve your problem please provide a [MRE](https://stackoverflow.com/help/minimal-reproducible-example) or a test string. – Jan Gabriel Aug 04 '21 at 09:48
  • Consecutive line feeds are not the solution either, as the white lines most often contain spaces or tabs. This is the entire reason I made this ask. – Moige Aug 04 '21 at 09:50
  • I see, can you post an example text? – Jan Gabriel Aug 04 '21 at 09:52
  • Don't hesitate: provide an example of input and the corresponding output. Your problem definition is not accurate enough to unambiguously clarify the problem. – Costantino Grana Aug 04 '21 at 09:52
  • Stackoverflow is unfortunately very uncooperative with the display of whitespace in the ask. – Moige Aug 04 '21 at 09:56
  • I checked the result of your RegEx and it removes the newline on the last line. But coming to your question, you have a solution (RegEx). Use it. Measure performance. Too slow? Go for the "hand made" solution. Measure performance. Is it faster? – Costantino Grana Aug 04 '21 at 10:03
  • @CostantinoGrana If I leave a line feed at the end, it means adding an empty line. That is the opposite of my goal. I have a working solution, I seek the best solution, if it exists. – Moige Aug 04 '21 at 10:06
  • Moreover, is it for files? Then check if keeping the whole file in memory is really necessary. IO can also be impacting here. – Costantino Grana Aug 04 '21 at 10:06
  • @CostantinoGrana It is data from an indexed data file, to be eventually displayed in a form. – Moige Aug 04 '21 at 10:07
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235617/discussion-between-costantino-grana-and-ljisa-moige). – Costantino Grana Aug 04 '21 at 10:11
  • It depends on your processing. Skipping header lines is trivial (just use `find_first_not_of` with what you considere as whitespaces (`" \t\r\"` is common but you could want to add `"\v\f"`)). For trailing lines, just keep them in a temporary vector until you find a non blank line... – Serge Ballesta Aug 04 '21 at 10:17
  • 1
    @LjisaMoige You can use code tags (\`) or `
    ` in your question's body to preserve whitespace formatting and show us some exact examples, if necessary.
    – TylerH Aug 04 '21 at 14:14

2 Answers2

1

As you can see from this discussion, trimming whitespace requires a lot of work in C++. This should definitely be included in the standard library.

Anyway, I've checked how to do it as simply as possible, but nothing comes near the compactness of RegEx. For speed, it's a different story.

In the following you can find three versions of a program which does the required task. With regex, with std functions and with just a couple of indexes. The last one can be also made faster because you can avoid copying altogether, but I left it for fair comparison:

#include <string>
#include <sstream>
#include <chrono>
#include <iostream>
#include <regex>
#include <exception>

struct perf {
    std::chrono::steady_clock::time_point start_;
    perf() : start_(std::chrono::steady_clock::now()) {}
    double elapsed() const {
        auto stop = std::chrono::steady_clock::now();
        std::chrono::duration<double> elapsed_seconds = stop - start_;
        return elapsed_seconds.count();
    }
};

std::string Generate(size_t line_len, size_t empty, size_t nonempty) {
    std::string es(line_len, ' ');
    es += '\n';
    for (size_t i = 0; i < empty; ++i) {
        es += es;
    }

    std::string nes(line_len - 1, ' ');
    es += "a\n";
    for (size_t i = 0; i < nonempty; ++i) {
        nes += nes;
    }

    return es + nes + es;
}


int main()
{
    std::string test;
    //test = "  \n\t\n  \n  \tTEST\n\tTEST\n\t\t\n  TEST\t\n   \t\n \n  ";
    std::cout << "Generating...";
    std::cout.flush();
    test = Generate(1000, 8, 10);
    std::cout << " done." << std::endl;

    std::cout << "Test 1...";
    std::cout.flush();
    perf p1;
    std::string out1;
    std::regex re(R"(^\s*\n|\n\s*$)");
    try {
        out1 = std::regex_replace(test, re, "");
    }
    catch (std::exception& e) {
        std::cout << e.what() << std::endl;
    }
    std::cout << " done. Elapsed time: " << p1.elapsed() << "s" << std::endl;

    std::cout << "Test 2...";
    std::cout.flush();
    perf p2;
    std::stringstream is(test);
    std::string line;
    while (std::getline(is, line) && line.find_first_not_of(" \t\n\v\f\r") == std::string::npos);
    std::string out2 = line;
    size_t end = out2.size();
    while (std::getline(is, line)) {
        out2 += '\n';
        out2 += line;
        if (line.find_first_not_of(" \t\n\v\f\r") != std::string::npos) {
            end = out2.size();
        }
    }
    out2.resize(end);
    std::cout << " done. Elapsed time: " << p2.elapsed() << "s" << std::endl;

    if (out1 == out2) {
        std::cout << "out1 == out2\n";
    }
    else {
        std::cout << "out1 != out2\n";
    }

    std::cout << "Test 3...";
    std::cout.flush();
    perf p3;
    static bool whitespace_table[] = {
        1,1,1,1,1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    };
    size_t sfl = 0; // Start of first line
    for (size_t i = 0, end = test.size(); i < end; ++i) {
        if (test[i] == '\n') {
            sfl = i + 1;
        }
        else if (whitespace_table[(unsigned char)test[i]]) {
            break;
        }
    }
    size_t ell = test.size(); // End of last line
    for (size_t i = test.size(); i-- > 0;) {
        if (test[i] == '\n') {
            ell = i;
        }
        else if (whitespace_table[(unsigned char)test[i]]) {
            break;
        }
    }
    std::string out3 = test.substr(sfl, ell - sfl);
    std::cout << " done. Elapsed time: " << p3.elapsed() << "s" << std::endl;

    if (out1 == out3) {
        std::cout << "out1 == out3\n";
    }
    else {
        std::cout << "out1 != out3\n";
    }

    return 0;
}

Running it on C++ Shell you get these timings:

Generating... done.
Test 1... done. Elapsed time: 4.2288s
Test 2... done. Elapsed time: 0.0077323s
out1 == out2
Test 3... done. Elapsed time: 0.000695783s
out1 == out3

If performance is important, it's better to really test it with the real files.

As a side note, this regex doesn't work on MSVC, because I couldn't find a way of avoiding ^ and $ to match the start and end of lines, that is disable the multiline mode of operation. If you run this, it throws an exception saying regex_error(error_complexity): The complexity of an attempted match against a regular expression exceeded a pre-set level. I think I'll ask how to cope with this!

Costantino Grana
  • 3,132
  • 1
  • 15
  • 35
  • For the third version, it would be better to start by doing the backwards check, then trimming the end, then doing the forward check to avoid reading twice a large whitespace text. For the regex I use `\A` and `\Z` to avoid it sticking to the edges of individual lines (see the ask, it also has to handle a completely white input). – Moige Aug 04 '21 at 18:35
  • Are `\A` and `\Z` available in C++ regex? I couldn't find anything about them. – Costantino Grana Aug 04 '21 at 19:30
0

If whitespace in front of the first line or after the last non-whitespace-only line can be removed then this answer https://stackoverflow.com/a/217605/14258355 will suffice.

However, due to this constraint and if you do not want to use regex, I would propose to convert the string into lines and then build the string back up again from the first to the last non-whitespace-only line.

Here is a working example: https://godbolt.org/z/rozxj6saj

Convert the string to lines:

std::vector<std::string> StringToLines(const std::string &s) {
  // Create vector with lines (not using input stream to keep line break
  // characters)
  std::vector<std::string> result;
  std::string line;

  for (auto c : s) {
    line.push_back(c);

    // Check for line break
    if (c == '\n' || c == '\r') {
      result.push_back(line);
      line.clear();
    }
  }

  // add last bit
  result.push_back(line);

  return result;
}

Build the string from the first to the last non-whitespace-only line:

bool IsNonWhiteSpaceString(const std::string &s) {
  return s.end() != std::find_if(s.begin(), s.end(), [](unsigned char uc) {
           return !std::isspace(uc);
         });
}

std::string TrimVectorEmptyEndsIntoString(const std::vector<std::string> &v) {
  std::string result;

  // Find first non-whitespace line
  auto it_begin = std::find_if(v.begin(), v.end(), [](const std::string &s) {
    return IsNonWhiteSpaceString(s);
  });

  // Find last non-whitespace line
  auto it_end = std::find_if(v.rbegin(), v.rend(), [](const std::string &s) {
    return IsNonWhiteSpaceString(s);
  });

  // Build the string
  for (auto it = it_begin; it != it_end.base(); std::advance(it, 1)) {
    result.append(*it);
  }

  return result;
}

Usage example:

 // Create a test string
  std::string test_string(
      "  \n\t\n  \n   TEST\n\tTEST\n\t\tTEST\n  TEST\t\n   \t");

  // Output result
  std::cout << TrimVectorEmptyEndsIntoString(StringToLines(test_string));

Output showing whitespace:

Output showing whitespace

Jan Gabriel
  • 1,066
  • 6
  • 15
  • But also the spaces and `\t` at the beginning of the first line and the one at the end of the last one. – Costantino Grana Aug 04 '21 at 10:15
  • I understand. I'll change it. – Jan Gabriel Aug 04 '21 at 10:19
  • It appears a lot more expensive than character-level parsing, if only because the entire text is read, and the bordering white spaces are read a second time after the first parsing. It also doesn't handle a completely whitespace input, which might happen. – Moige Aug 04 '21 at 12:31
  • As it stands I'm already set on using the regex solution, I also have a method that parses my text at the character level, but it is ugly. I'm not really asking for a complex code solution. I expected for there to be a built-in method to achieve what I wanted. An answer saying it doesn't exist would suffice. – Moige Aug 04 '21 at 12:39
  • @LjisaMoige, well there you have it. Regex for the win, and a short `std` answer to your question doesn't exist (as far as I'm aware) – Jan Gabriel Aug 04 '21 at 12:50