4

I have loaded a text file content to a std::string. And I want to remove the whitespaces from the loaded string.

  1. Which below method need to be used for better performance?
  2. Which below method can be a best practice.?
  3. Else any better way is there to achieve this?

Method 1:
Scan the string using string.find() in a loop statement and remove whitespace using string.erase();

Method 2:
Scan the string using string.find() in a loop statement and copy the non whitespace characters to a new string() variable using string.append();

Method 3:
Scan the string using string.find() in a loop statement and copy the non whitespace characters to a new string(size_t n, char c) variable using string.replace();

Method 4:
Allocate a char* (using malloc[size of the source string])
Scan the string using string.find() in a loop statement and copy the non whitespace characters to the char* variable using strcpy and then strcat();
finally copy the char* to a new string
free char*

Jeet
  • 1,006
  • 1
  • 14
  • 25
  • 1
    can you change the way the file is loaded? – m.s. Aug 12 '15 at 08:11
  • possible duplicate of [How to remove all the occurrences of a char in c++ string](http://stackoverflow.com/questions/20326356/how-to-remove-all-the-occurrences-of-a-char-in-c-string) – Holt Aug 12 '15 at 08:13
  • 1
    You'd better first establish that there is a performance problem before you are concerned with performance. – dtech Aug 12 '15 at 08:13
  • @m.s. Thanks. Its a best way and i can go with the Method 4 malloc(file_size) when i read a file from disk. But the API i am using reading a remote file and returning a string type. Do i need to edit the question? – Jeet Aug 12 '15 at 08:27
  • 1
    just add this information in your question because it restricts some solutions – m.s. Aug 12 '15 at 08:28

6 Answers6

12

Edit: upgraded to locale-aware trait. Thanks user657267 !

Standards algorithms are neat !

s.erase(std::remove_if(
    begin(s), end(s),
    [l = std::locale{}](auto ch) { return std::isspace(ch, l); }
), end(s));

Live on Coliru

That was for in-place modification, if you need to keep the original string however :

std::string s2;
s2.reserve(s.size());
std::remove_copy_if(
    begin(s), end(s),
    std::back_inserter(s2),
    [l = std::locale{}](auto ch) { return std::isspace(ch, l); }
);

Live on Coliru

Quentin
  • 62,093
  • 7
  • 131
  • 191
  • I think it's pretty self documenting code, but performance can be an issue. AFAIK remove_if works using shifts of data to remove gaps. – Elohim Meth Aug 12 '15 at 08:52
  • @ElohimMeth depends on whether the modification must be made in-place. – Quentin Aug 12 '15 at 08:54
  • @user657267 I'm not sure of what I should pass as the `locale` argument (I don't have a lot of text manipulation stuff behind me) – Quentin Aug 12 '15 at 08:55
  • 1
    For not-in-place cases, we have `remove_copy_if`. – T.C. Aug 12 '15 at 08:56
  • @Quentin generally you just pass in a default-constructed `std::locale`, which will be the current global C++ locale, I posted an example below. – user657267 Aug 12 '15 at 08:56
  • 1
    @ddriver standard algorithm are certainly a good first try though. The implementers are supposed to know their stuff. – Quentin Aug 12 '15 at 09:04
  • 1
    @ddriver I don't see why it would, but then I'm no library implementer ;) – Quentin Aug 12 '15 at 09:05
  • 2
    @Quentin Suppose you are right. Using separately allocated string, requires one copy of each non whitespace character. Correctly written remove_if, requires only one move of each non whitespace character. – Elohim Meth Aug 12 '15 at 09:44
3

For readability I would go with the boost string algorithm library

#include <boost/algorithm/string.hpp>
std::string str1="Hello  Dolly,   Hello World!"
boost::erase_all(str1, " ");

EDIT: Example for all whitespace characters:

std::string str1="Hello  Dolly,\n   Hello\t World!\t";
boost::find_format_all(str1, boost::token_finder(::isspace), boost::const_formatter(""));

EDIT2: I ran some benchmarks to see how this method compares with Quentin's answer. I ran 100 samples using both the locale aware and non locale aware version of isspace. The average times in micro seconds were:

| method                            | avg. time (μs) |
|-----------------------------------|----------------|
| boost::find_format_all w/locale   | 2.02429        |
| boost::find_format_all w/o locale | 0.578105588    |
| std::remove_if w/locale           | 1.197737742    |
| std::remove_if w/o locale         | 0.190661227    |
sjdowling
  • 2,994
  • 2
  • 21
  • 31
  • The other way around… not all whitespace is blanks ("spacebars"). In particular, newlines and tabs are also whitespace. – Arne Vogel Aug 12 '15 at 09:27
  • @ArneVogel @ddriver I've added a more generic example which will replace all whitespace characters as defined by `::isspace`. Obviously this can be replaced by a locale aware version. – sjdowling Aug 12 '15 at 10:28
3

IMHO, best performance you can get with method 2, but before appending you need to call std::string::reserve method to set capacity of new string to the size of initial string. This needed to prevent unnecessary reallocations, when appending.

Elohim Meth
  • 1,777
  • 9
  • 13
2

Method 5: Cover all edge cases (including the current locale) with the library, and make adjustments if and only if profiling shows it to be an issue.

#include <algorithm>
#include <functional>
#include <iostream>
#include <locale>
#include <string>

template<typename CharT, typename Traits, typename Allocator>
std::basic_string<CharT, Traits, Allocator>
strip_whitespace(std::basic_string<CharT, Traits, Allocator> str)
{
  str.erase(
    remove_if(str.begin(), str.end(), bind(
      std::isspace<CharT>, std::placeholders::_1, std::locale{}
    )),
    str.end()
  );
  return str;
}

int main()
{
  std::string str{"A string with \n whitespace"};
  std::cout << str << '\n' << strip_whitespace(str) << '\n';
}
user657267
  • 20,568
  • 5
  • 58
  • 77
2

It is not clear what exactly do you mean by "remove white spaces" - does it intend to make text unreadable and break source code, or do you mean "any excess whitespace"?

Since there was one answer that suggested a third party library, I will go in with a method from Qt, which will trim out only "excess whitespace":

QString s("Test\n new line,   multiple spacebars \t tab!");

qDebug() << s;
qDebug() << s.simplified();

Output:

"Test
 new line,   multiple spacebars      tab!"

"Test new line, multiple spacebars tab!"

Most of the other answers focus on removing all spacebars, but strictly speaking, there are other characters which classify as whitespace such as tab or new line.

Looking at the code for the simplified() method, I'd say it is fairly efficient:

QString QString::simplified() const
{
    if (d->size == 0)
        return *this;
    QString result(d->size, Qt::Uninitialized);
    const QChar *from = (const QChar*) d->data;
    const QChar *fromend = (const QChar*) from+d->size;
    int outc=0;
    QChar *to   = (QChar*) result.d->data;
    for (;;) {
        while (from!=fromend && from->isSpace())
            from++;
        while (from!=fromend && !from->isSpace())
            to[outc++] = *from++;
        if (from!=fromend)
            to[outc++] = QLatin1Char(' ');
        else
            break;
    }
    if (outc > 0 && to[outc-1] == QLatin1Char(' '))
        outc--;
    result.truncate(outc);
    return result;
}

The new string is preallocated without initialization, so both any initialization and any reallocations are avoided, then similarly to method 2 (keep in mind appending without reserving space will result in many and slow reallocations), only the useful data is copied to the result string, and in the end, it is trimmed shorter to remove any wasted memory.

You could follow this logic to implement it efficiently for std::string

Taken directly from the Qt source code.

dtech
  • 47,916
  • 17
  • 112
  • 190
1

For performance it would be better to read the string one platform word at a time (8 bytes on 64-bit platforms), then extract each character from the register read, test it for whitespace and if it is not a whitespace, add it to the platform word wide register to be written next. When the register to be written is full (as many characters are stored as can fit into the register), write it to the output buffer allocated in advance. Character-by-character scan is 8 times slower for single-byte character strings.

Finally, the tail of the string may contain less characters than would occupy a full platform word. A counter is needed then to know how many characters are in the last platform word processed.

Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • 2
    this is a microoptimization. You should rely on the compiler to do it. You should resort to this *if ever) only after profiling. – bolov Aug 12 '15 at 08:31
  • @bolov, I profiled a case of copying byte by byte. It's indeed much slower than copying of 64-bit values at a time. Assuming the question deems this string processing as a bottleneck, the optimization makes sense. And 8 times improvement in the bottleneck is not a "micro". – Serge Rogatch Aug 12 '15 at 08:43
  • I am not saying that it is not good or faster. Just that thinking about register size and company is a microoptimization, regardless of what speedup you get. Just saying that you shouldn't jump to this when you start coding. You should know and understand this, but consider it only after profiling. I am not saying it's a bad answer or suggestion, just wanted to add a clarification. – bolov Aug 12 '15 at 09:03
  • I'm not in the position to test this. I am surprised that a simple for isn't optimized.... 8 times speed up seems suspicious..... – bolov Aug 12 '15 at 09:06
  • ...although you might be right. There are dependencies between s[i] and s[i+1], so the compiler might have to keep the access sequential.... – bolov Aug 12 '15 at 09:08
  • @bolov, why not to implement this algorithm and see yourself? – Serge Rogatch Aug 12 '15 at 10:06
  • told you, can't right now. No access to PC. We'll try sometime. – bolov Aug 12 '15 at 10:53