3

Using the C++ Standard Template Library function regex_replace(), how do I remove non-numeric characters from a std::string and return a std::string?

This question is not a duplicate of question 747735 because that question requests how to use TR1/regex, and I'm requesting how to use standard STL regex, and because the answer given is merely some very complex documentation links. The C++ regex documentation is extremely hard to understand and poorly documented, in my opinion, so even if a question pointed out the standard C++ regex_replace documentation, it still wouldn't be very useful to new coders.

Volomike
  • 23,743
  • 21
  • 113
  • 209

3 Answers3

10
// assume #include <regex> and <string>
std::string sInput = R"(AA #-0233 338982-FFB /ADR1 2)";
std::string sOutput = std::regex_replace(sInput, std::regex(R"([\D])"), "");
// sOutput now contains only numbers

Note that the R"..." part means raw string literal and does not evaluate escape codes like a C or C++ string would. This is very important when doing regular expressions and makes your life easier.

Here's a handy list of single-character regular expression raw literal strings for your std::regex() to use for replacement scenarios:

  • R"([^A-Za-z0-9])" or R"([^A-Za-z\d])" = select non-alphabetic and non-numeric
  • R"([A-Za-z0-9])" or R"([A-Za-z\d])" = select alphanumeric
  • R"([0-9])" or R"([\d])" = select numeric
  • R"([^0-9])" or R"([^\d])" or R"([\D])" = select non-numeric
Volomike
  • 23,743
  • 21
  • 113
  • 209
  • 1
    I think you can also use a slightly simplified overload: `sOutput = std::regex_replace(sInput, std::regex("([^0-9])"), "");`. – vsoftco Feb 09 '16 at 02:40
  • 1
    Your `regex_replace` can be simplified to `std::regex_replace(sInput, std::regex(R"([^\d])"), "")` [Demo](http://coliru.stacked-crooked.com/a/cac359fa8918bc87) – Praetorian Feb 09 '16 at 02:49
  • 1
    @Volomike Those are called *raw string literals* and were introduced in C++11. See [this](https://solarianprogrammer.com/2011/10/16/cpp-11-raw-strings-literals-tutorial/) for a brief tutorial. – vsoftco Feb 09 '16 at 03:01
  • Since you're using `\D` now, you don't need the character set, you can just use `R"(\D)"` aka `"\\D"` also the `\w` is very close to alphanumeric but it includes `_`. so `R"([^A-Za-z0-9])"` = `R"([^\w_])"` and `R"([A-Za-z0-9])"` = `R"([^\W_])"` and of course if you want to include the `_` you can just use `\w` without the `[]` – Rick Jun 08 '21 at 12:44
5

Regular expressions are overkill here.

#include <algorithm>
#include <iostream>
#include <iterator>
#include <string>

inline bool not_digit(char ch) {
    return '0' <= ch && ch <= '9';
}

std::string remove_non_digits(const std::string& input) {
    std::string result;
    std::copy_if(input.begin(), input.end(),
        std::back_inserter(result),
        not_digit);
    return result;
}

int main() {
    std::string input = "1a2b3c";
    std::string result = remove_non_digits(input);
    std::cout << "Original: " << input << '\n';
    std::cout << "Filtered: " << result << '\n';
    return 0;
}
Volomike
  • 23,743
  • 21
  • 113
  • 209
Pete Becker
  • 74,985
  • 8
  • 76
  • 165
  • Have you run a speed test comparison between your routine and the regex_replace? – Volomike Feb 09 '16 at 14:58
  • Also, your routine does the opposite of the question. You're removing digits, not non-digits. I'm also curious if your code will work on UTF-8 strings because you've chosen `char ch`. – Volomike Feb 09 '16 at 15:47
  • @Volomike - fixed, it now removes non-digits. – Pete Becker Feb 09 '16 at 16:12
  • @Volomike - I didn't choose `char ch`, you did. That's what `std::string` traffics in. But, yes, the pattern shown here will work for pretty much any data type in pretty much any container. – Pete Becker Feb 09 '16 at 16:14
  • your karma score and and fact that you're on the C++ STL core team speaks volumes. It's just that I read [this article](https://matt.sh/howto-c) about how using `char` is now bad. I'm pretty much a C++ noob in my late 40s (coding since Junior High), but learn super fast. So, if you could shed light on that, I'd appreciate it. My concern is that I need to support languages like Japanese in my work. I'm also interested if I can remove the `back_inserter()` and make the output be the return of `copy_if()`, and if we can get a speed test on this to see which implementation is fastest. – Volomike Feb 09 '16 at 17:41
  • 1
    That article is far over the top, so take it with several grains of salt. "abcd" has type `array of 5 char`, so using `char` to manage it is really the only sensible thing to do. `int` is supposed to be the "natural size" for the architecture, so someone who insists on choosing their own fixed-width types may well end up making sub-optimal choices; for portable code, the compiler writer knows each target system better than you do. And, of course, the problem with using 'uint8_t` to mean "byte" is that it won't exist on systems that don't have a native 8-bit hardware type.... – Pete Becker Feb 09 '16 at 19:31
  • 1
    ... there are good reasons for the flexibility that the C integer type system provides; labelling new stuff "modern" begs the question. Newer isn't necessarily better. – Pete Becker Feb 09 '16 at 19:33
  • 1
    @Volomike (regarding the linked article): There's an interesting question what "portable" means. It can either mean to express a problem in sufficiently general terms (`int`) that the program can be built everywhere and will have platform-dependent constraints, or it can mean that you express the precise operational details (`int16_t`) and require each platform to provide those. Each of these concepts has its place, and it seems premature to abandon one in favour of the other. – Kerrek SB Feb 10 '16 at 10:21
  • `not_digit` looks like it should be called `is_digit`. Or better yet, just use `std::isdigit`. – Edward Brey Nov 02 '18 at 14:22
2

The accepted answer if fine for the specifics of the given sample. But it will fail for a number such as "-12.34" (it would result in "1234"). (note how the sample could be negative numbers)

Then the regex should be:

 (-|\+)?(\d)+(.(\d)+)*

explanation: (optional ( "-" or "+" )) with (a number, repeated 1 to n times) with (optionally end's with: ( a "." followed by (a number, repeated 1 to n times) )

A bit over-reaching, but I was looking for this and the page showed up first in my search, so I'm adding my answer for future searches.

LoveToCpp
  • 21
  • 1