33

I am writing a piece of software, and It require me to handle data I get from a webpage with libcurl. When I get the data, for some reason it has extra line breaks in it. I need to figure out a way to only allow letters, numbers, and spaces. And remove everything else, including line breaks. Is there any easy way to do this? Thanks.

Austin Witherspoon
  • 725
  • 2
  • 7
  • 6

12 Answers12

52

Write a function that takes a char and returns true if you want to remove that character or false if you want to keep it:

bool my_predicate(char c);

Then use the std::remove_if algorithm to remove the unwanted characters from the string:

std::string s = "my data";
s.erase(std::remove_if(s.begin(), s.end(), my_predicate), s.end());

Depending on your requirements, you may be able to use one of the Standard Library predicates, like std::isalnum, instead of writing your own predicate (you said you needed to match alphanumeric characters and spaces, so perhaps this doesn't exactly fit what you need).

If you want to use the Standard Library std::isalnum function, you will need a cast to disambiguate between the std::isalnum function in the C Standard Library header <cctype> (which is the one you want to use) and the std::isalnum in the C++ Standard Library header <locale> (which is not the one you want to use, unless you want to perform locale-specific string processing):

s.erase(std::remove_if(s.begin(), s.end(), (int(*)(int))std::isalnum), s.end());

This works equally well with any of the sequence containers (including std::string, std::vector and std::deque). This idiom is commonly referred to as the "erase/remove" idiom. The std::remove_if algorithm will also work with ordinary arrays. The std::remove_if makes only a single pass over the sequence, so it has linear time complexity.

James McNellis
  • 348,265
  • 75
  • 913
  • 977
  • 3
    @James: It is removing alpha numeric characters instead of special characters. am i doing something wrong ? – bjskishore123 Jul 04 '13 at 11:22
  • 2
    It will remove alphanumeric characters and not special characters because `(int(*)(int))std::isalnum` will return `true` whenever an alphanumeric character is encountered and that character will be erased from the string. – Sumit Gera Dec 28 '13 at 22:35
  • 4
    `(int(*)(int))std::isalnum` will keep only the special characters, instead use `std::not1(std::ptr_fun( (int(*)(int))std::isalnum ))` to invert its logic – Megarushing Jan 19 '17 at 21:37
  • 1
    As said this will remove the alphanumeric chars, needs to be inverted – Dado Aug 05 '17 at 11:53
16

Previous uses of std::isalnum won't compile with std::ptr_fun without passing the unary argument is requires, hence this solution with a lambda function should encapsulate the correct answer:

s.erase(std::remove_if(s.begin(), s.end(), 
[]( auto const& c ) -> bool { return !std::isalnum(c); } ), s.end());
Dado
  • 1,147
  • 1
  • 13
  • 31
  • Why do you need to include the &c in the auto, why not just c? – Podo May 18 '19 at 17:42
  • Yes you can have the signature you want, you can use a value, a value and a std::move, a perfect forwarding, etc... I think auto const& is the safer bet not knowing the real type as you are guaranteed no extra expensive copies, although in same cases a value/move is even more performant. And in same cases even a simple value for intrinsic types. – Dado May 18 '19 at 18:42
5

Just extending James McNellis's code a little bit more. His function is deleting alnum characters instead of non-alnum ones.

To delete non-alnum characters from a string. (alnum = alphabetical or numeric)

  • Declare a function (isalnum returns 0 if passed char is not alnum)

    bool isNotAlnum(char c) {
        return isalnum(c) == 0;
    }
    
  • And then write this

    s.erase(remove_if(s.begin(), s.end(), isNotAlnum), s.end());
    

then your string is only with alnum characters.

Ali Eren Çelik
  • 239
  • 4
  • 4
5

You could always loop through and just erase all non alphanumeric characters if you're using string.

#include <cctype>

size_t i = 0;
size_t len = str.length();
while(i < len){
    if (!isalnum(str[i]) || str[i] == ' '){
        str.erase(i,1);
        len--;
    }else
        i++;
}

Someone better with the Standard Lib can probably do this without a loop.

If you're using just a char buffer, you can loop through and if a character is not alphanumeric, shift all the characters after it backwards one (to overwrite the offending character):

#include <cctype>

size_t buflen = something;
for (size_t i = 0; i < buflen; ++i)
    if (!isalnum(buf[i]) || buf[i] != ' ')
        memcpy(buf[i], buf[i + 1], --buflen - i);
zlji
  • 3
  • 2
Seth Carnegie
  • 73,875
  • 22
  • 181
  • 249
  • 1
    Eliminating the loop would involve the [erase-remove idiom](http://en.wikipedia.org/wiki/Erase-remove_idiom) – Ismail Badawi Jun 12 '11 at 03:12
  • In your second case, if you maintain source and destination pointers, you can avoid doing a memcpy of the remaining buffer every time a character needs to be removed. i.e. for (char *s = buf, *d = buf; *s; ++s) { if (!isalnum(*s) || *s != ' ') *d++ = *s; } *d = 0; – Ferruccio Jun 12 '11 at 11:52
3

Benchmarking the different methods.

If you are looking for a benchmark I made one.

(115830 cycles) 115.8ms -> using stringstream
( 40434 cycles)  40.4ms -> s.erase(std::remove_if(s.begin(), s.end(), [](char c) { return !isalnum(c); }), s.end());
( 40389 cycles)  40.4ms -> s.erase(std::remove_if(s.begin(), s.end(), [](char c) { return ispunct(c); }), s.end());
( 42386 cycles)  42.4ms -> s.erase(remove_if(s.begin(), s.end(), not1(ptr_fun( (int(*)(int))isalnum ))), s.end());
( 42969 cycles)  43.0ms -> s.erase(remove_if(s.begin(), s.end(), []( auto const& c ) -> bool { return !isalnum(c); } ), s.end());
( 44829 cycles)  44.8ms -> alnum_from_libc(s) see below
( 24505 cycles)  24.5ms -> Puzzled? My method, see below
(  9717 cycles)   9.7ms -> using mask and bitwise operators

Original length: 8286208, current len with alnum only: 5822471

  • Stringstream gives terrible results (but we all know that)
  • The different answers already given gives about the same runtime
  • Doing it the C way consistently give better runtime (almost twice faster!), it is definitely worth considering, and on top of that it is compatible with C language.
  • My bitwise method (also C compatible) is more than 400% faster.

NB the selected answer had to be modified as it was keeping only the special characters

NB2: The test file is a (almost) 8192 kb text file with roughly 62 alnum and 12 special characters, randomly and evenly written.


Benchmark source code

#include <ctime>

#include <iostream>
#include <sstream>
#include <string>
#include <algorithm>

#include <locale> // ispunct
#include <cctype>

#include <fstream> // read file
#include <streambuf>

#include <sys/stat.h> // check if file exist
#include <cstring>

using namespace std;

bool exist(const char *name)
{
  struct stat   buffer;
  return !stat(name, &buffer);
}

constexpr int SIZE = 8092 * 1024;

void keep_alnum(string &s) {
    stringstream ss;
    int i = 0;
    for (i = 0; i < SIZE; i++)
        if (isalnum(s[i]))
            ss << s[i];
    s = ss.str();
}

/* my method, best runtime */
void old_school(char *s) {
    int n = 0;
    for (int i = 0; i < SIZE; i++) {
        unsigned char c = s[i] - 0x30; // '0'
        if (c < 10 || (c -= 0x11) < 26 || (c -= 0x20) < 26) // 0x30 + 0x11 = 'A' + 0x20 = 'a'
            s[n++] = s[i];
    }
    s[n] = '\0';
}

void alnum_from_libc(char *s) {
    int n = 0;
    for (int i = 0; i < SIZE; i++) {
        if (isalnum(s[i]))
            s[n++] = s[i];
    }
    s[n] = '\0';
}

#define benchmark(x) printf("\033[30m(%6.0lf cycles) \033[32m%5.1lfms\n\033[0m", x, x / (CLOCKS_PER_SEC / 1000))

int main(int ac, char **av) {
    if (ac < 2) {
        cout << "usage: ./a.out \"{your string} or ./a.out FILE \"{your file}\n";
        return 1;
    }
    string s;
    s.reserve(SIZE+1);
    string s1;
    s1.reserve(SIZE+1);
    char s4[SIZE + 1], s5[SIZE + 1];
    if (ac == 3) { 
        if (!exist(av[2])) {
            for (size_t i = 0; i < SIZE; i++)
              s4[i] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnoporstuvwxyz!@#$%^&*()__+:\"<>?,./'"[rand() % 74];
          s4[SIZE] = '\0';
          ofstream ofs(av[2]);
          if (ofs)
            ofs << s4;
        }
        ifstream ifs(av[2]);
        if (ifs) {
          ifs.rdbuf()->pubsetbuf(s4, SIZE);
          copy(istreambuf_iterator<char>(ifs), {}, s.begin());
        }
        else
          cout << "error\n";

        ifs.seekg(0, ios::beg);
        s.assign((istreambuf_iterator<char>(ifs)), istreambuf_iterator<char>());
    }
    else
        s = av[1];

    double elapsedTime;
    clock_t start;
    bool is_different = false;

    s1 = s;
    start = clock();
    keep_alnum(s1);
    elapsedTime = (clock() - start);
    benchmark(elapsedTime);
    string tmp = s1;

    s1 = s;
    start = clock();
    s1.erase(std::remove_if(s1.begin(), s1.end(), [](char c) { return !isalnum(c); }), s1.end());
    elapsedTime = (clock() - start);
    benchmark(elapsedTime);
    is_different |= !!strcmp(tmp.c_str(), s1.c_str());

    s1 = s;
    start = clock();
    s1.erase(std::remove_if(s1.begin(), s1.end(), [](char c) { return ispunct(c); }), s1.end());
    elapsedTime = (clock() - start);
    benchmark(elapsedTime);
    is_different |= !!strcmp(tmp.c_str(), s1.c_str());

    s1 = s;
    start = clock();
    s1.erase(remove_if(s1.begin(), s1.end(), not1(ptr_fun( (int(*)(int))isalnum ))), s1.end());
    elapsedTime = (clock() - start);
    benchmark(elapsedTime);
    is_different |= !!strcmp(tmp.c_str(), s1.c_str());

    s1 = s;
    start = clock();
    s1.erase(remove_if(s1.begin(), s1.end(), []( auto const& c ) -> bool { return !isalnum(c); } ), s1.end());
    elapsedTime = (clock() - start);
    benchmark(elapsedTime);
    is_different |= !!strcmp(tmp.c_str(), s1.c_str());

    memcpy(s4, s.c_str(), SIZE);
    start = clock();
    alnum_from_libc(s4);
    elapsedTime = (clock() - start);
    benchmark(elapsedTime);
    is_different |= !!strcmp(tmp.c_str(), s4);

    memcpy(s4, s.c_str(), SIZE);
    start = clock();
    old_school(s4);
    elapsedTime = (clock() - start);
    benchmark(elapsedTime);
    is_different |= !!strcmp(tmp.c_str(), s4);

    cout << "Original length: " << s.size() << ", current len with alnum only: " << strlen(s4) << endl;
    // make sure that strings are equivalent
    printf("\033[3%cm%s\n", ('3' + !is_different), !is_different ? "OK" : "KO");

    return 0;
}

My solution

For the bitwise method you can check it directly on my github, basically I avoid branching instructions (if) thanks to the mask. I avoid posting bitwise operations with C++ tag, I get a lot of hate for it.

For the C style one, I iterate over the string and have two index: n for the characters we keep and i to go through the string, where we test one after another if it is a digit, a uppercase or a lowercase.

Add this function:

void strip_special_chars(char *s) {
    int n = 0;
    for (int i = 0; i < SIZE; i++) {
        unsigned char c = s[i] - 0x30;
        if (c < 10 || (c -= 0x11) < 26 || (c -= 0x20) < 26) // 0x30 + 0x11 = 'A' + 0x20 = 'a'
            s[n++] = s[i];
    }
    s[n] = '\0';
}

and use as:

char s1[s.size() + 1]
memcpy(s1, s.c_str(), s.size());
strip_special_chars(s1);
Antonin GAVREL
  • 9,682
  • 8
  • 54
  • 81
2
#include <cctype>
#include <string>
#include <functional>

std::string s = "Hello World!";
s.erase(std::remove_if(s.begin(), s.end(),
    std::not1(std::ptr_fun(std::isalnum)), s.end()), s.end());
std::cout << s << std::endl;

Results in:

"HelloWorld"

You use isalnum to determine whether or not each character is alpha numeric, then use ptr_fun to pass the function to not1 which NOTs the returned value, leaving you with only the alphanumeric stuff you want.

Lucas Walter
  • 942
  • 3
  • 10
  • 23
TankorSmash
  • 12,186
  • 6
  • 68
  • 106
2

The remove_copy_if standard algorithm would be very appropriate for your case.

Eugen Constantin Dinca
  • 8,994
  • 2
  • 34
  • 51
1

You can use the remove-erase algorithm this way -

// Removes all punctuation       
s.erase( std::remove_if(s.begin(), s.end(), &ispunct), s.end());
akrita
  • 403
  • 1
  • 5
  • 11
1

Below code should work just fine for given string s. It's utilizing <algorithm> and <locale> libraries.

std::string s("He!!llo  Wo,@rld! 12 453");
s.erase(std::remove_if(s.begin(), s.end(), [](char c) { return !std::isalnum(c); }), s.end());
1

The mentioned solution

s.erase( std::remove_if(s.begin(), s.end(), &std::ispunct), s.end());

is very nice, but unfortunately doesn't work with characters like 'Ñ' in Visual Studio (debug mode), because of this line:

_ASSERTE((unsigned)(c + 1) <= 256)

in isctype.c

So, I would recommend something like this:

inline int my_ispunct( int ch )
{
    return std::ispunct(unsigned char(ch));
}
...
s.erase( std::remove_if(s.begin(), s.end(), &my_ispunct), s.end());
Andres Hurtis
  • 255
  • 3
  • 6
0

The following works for me.

str.erase(std::remove_if(str.begin(), str.end(), &ispunct), str.end());
str.erase(std::remove_if(str.begin(), str.end(), &isspace), str.end());
Pabitra Dash
  • 1,461
  • 2
  • 21
  • 28
-1
void remove_spaces(string data)
{ int i=0,j=0;
    while(i<data.length())
    {
        if (isalpha(data[i]))
        {
        data[i]=data[i];
        i++;
        }
        else
            {
            data.erase(i,1);}
    }
    cout<<data;
}