Regex Replacing : to ":" etc

Question

I've got a bunch of strings like:

"Hello, here's a test colon&#58;. Here's a test semi-colon&#59;"

I would like to replace that with

"Hello, here's a test colon:. Here's a test semi-colon;"

And so on for all printable ASCII values.

At present I'm using boost::regex_search to match &#(\d+);, building up a string as I process each match in turn (including appending the substring containing no matches since the last match I found).

Can anyone think of a better way of doing it? I'm open to non-regex methods, but regex seemed a reasonably sensible approach in this case.

Thanks,

Dom

Number in entities is NOT ASCII. It's Unicode codepoint number and it can be outside 0-255 range. — Kornel, Jan 09 '09 at 13:33
... in which case we presumably can leave it untouched. (BTW, printable ASCII range is 32-126) — MSalters, Jan 09 '09 at 13:47
I wouldn't be surprised if there's already a library out there (possibly even as part of boost) to convert XML entities to their utf-8/utf-16 equivalents. — Powerlord, Jan 09 '09 at 15:32
If there is it's good to know that the formal name of those entities is Numeric character reference (NCR). See http://en.wikipedia.org/wiki/Numeric_character_reference – PEZ — PEZ, Jan 09 '09 at 15:50

MSalters · Accepted Answer · 2009-01-09T15:14:37.497

9

The big advantage of using a regex is to deal with the tricky cases like &#38; Entity replacement isn't iterative, it's a single step. The regex is also going to be fairly efficient: the two lead characters are fixed, so it will quickly skip anything not starting with &#. Finally, the regex solution is one without a lot of surprises for future maintainers.

I'd say a regex was the right choice.

Is it the best regex, though? You know you need two digits and if you have 3 digits, the first one will be a 1. Printable ASCII is after all  -~. For that reason, you could consider &#1?\d\d;.

As for replacing the content, I'd use the basic algorithm described for boost::regex::replace :

For each match // Using regex_iterator<>
    Print the prefix of the match
    Remove the first 2 and last character of the match (&#;)
    lexical_cast the result to int, then truncate to char and append.

Print the suffix of the last match.

edited Jan 09 '09 at 15:14

answered Jan 09 '09 at 13:56

MSalters

173,980
10
155
350

Good suggestion on the ?\d\d - thanks (and +1). Can you think of a way of doing the replace with a regex as well? – Dominic Rodger Jan 09 '09 at 14:27
Thanks - the algorithm you've given is roughly what I did (though I used (1?\d\d); to allow me access to the numeric value without stripping off characters). Running the resulting algorithm 100,000 times over a 320 character string with 20 values to replace takes 10 seconds. Nice! – Dominic Rodger Jan 09 '09 at 16:56
Oh - if speed matters, abuse the fact that you've already validated your input. The result is then ((match[5]==';') ? (match[3]*10+match[4]) : (100+match[4]*10+match[5])) -'0'*11 – MSalters Jan 14 '09 at 13:54

score 3 · Answer 2 · answered Jan 09 '09 at 14:54

3

This will probably earn me some down votes, seeing as this is not a c++, boost or regex response, but here's a SNOBOL solution. This one works for ASCII. Am working on something for Unicode.

        NUMS = '1234567890'
MAIN    LINE = INPUT                                :F(END)
SWAP    LINE ?  '&#' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP)
        OUTPUT = LINE                               :(MAIN)
END

answered Jan 09 '09 at 14:54

bugmagnet

7,631
8
69
131

I'm impressed. Can't make sense of that code at all and must now check SNOBOL up. +1 – PEZ Jan 09 '09 at 17:07
This is not correct. It will decode "A" to "A" instead of "A" – Andru Luvisi Jan 09 '09 at 17:46
@Glomek is right, I think the right code may be SWAP LINE ? REM '' SPAN(NUMS) . N ';' = CHAR( N ) :S(SWAP) – Jan 09 '09 at 18:01
That doesn't change the input string at all. – Andru Luvisi Jan 09 '09 at 18:31

score 3 · Answer 3 · edited Jan 09 '09 at 21:37

3

* Repaired SNOBOL4 Solution
* &#38;#38; -> &#38;
     digit = '0123456789'
main line = input                        :f(end)
     result = 
swap line arb . l
+    '&#' span(digit) . n ';' rem . line :f(out)
     result = result l char(n)           :(swap)
out  output = result line                :(main)
end

edited Jan 09 '09 at 21:37

Andru Luvisi

24,367
6
53
66

answered Jan 09 '09 at 18:42

"& A" => " A" instead of "& A" – Andru Luvisi Jan 09 '09 at 18:56
I was trying to make it more efficient -- obviously didn't work: try "arb" instead of "break('&')" – Jan 09 '09 at 21:28
I put the change into your post. So far I have not found an input that breaks this version. – Andru Luvisi Jan 09 '09 at 21:39
+1 -- I'm totally impressed that there are still SNOBOL programmers around -- I love that language, but haven't used it in >30 years. – Ken Paul Jan 09 '09 at 21:57

score 2 · Answer 4 · edited Jan 10 '09 at 13:42

2

I don't know about the regex support in boost, but check if it has a replace() method that supports callbacks or lambdas or some such. That's the usual way to do this with regexes in other languages I'd say.

Here's a Python implementation:

s = "Hello, here's a test colon&#58;. Here's a test semi-colon&#59;"
re.sub(r'&#(1?\d\d);', lambda match: chr(int(match.group(1))), s)

Producing:

"Hello, here's a test colon:. Here's a test semi-colon;"

I've looked some at boost now and I see it has a regex_replace function. But C++ really confuses me so I can't figure out if you could use a callback for the replace part. But the string matched by the (\d\d) group should be available in $1 if I read the boost docs correctly. I'd check it out if I were using boost.

edited Jan 10 '09 at 13:42

jfs

399,953
195
994
1,670

answered Jan 09 '09 at 13:31

PEZ

16,821
7
45
66

I've added '1?' to the regexp. – jfs Jan 10 '09 at 13:42
The `lamdba ..` is wrong. It replaces non-printable ASCII characters e.g., ''. Note: `c` is ASCII-printable iff 31 < ord(c) < 127 (for the sake of html documents). – jfs Jan 10 '09 at 16:36
Yeah, it's mainly meant as an example of the approach. – PEZ Jan 10 '09 at 16:56
To fix the code without touching the lambda you could use a more restrictive regex e.g., r'(3[2-9]|[4-9]\d|1(?:[01]\d|2[0-6]));' See http://stackoverflow.com/questions/428013/regex-replacing-58-to-etc#433565 – jfs Jan 12 '09 at 15:54

score 1 · Answer 5 · edited Jan 09 '09 at 23:59

1

The existing SNOBOL solutions don't handle the multiple-patterns case properly, due to there only being one "&". The following solution ought to work better:

        dd = "0123456789"
        ccp = "#" span(dd) $ n ";" *?(s = s char(n)) fence (*ccp | null)
   rdl  line = input                              :f(done)
   repl line "&" *?(s = ) ccp = s                 :s(repl)
        output = line                             :(rdl)
   done
   end

edited Jan 09 '09 at 23:59

Andru Luvisi

24,367
6
53
66

answered Jan 09 '09 at 21:53

the single quote after the # should be a double-quote. sorry. – Jan 09 '09 at 21:55
This converts "&" into "&&" and not "&" as it should. It also converts "A;" to "A" and not "A" as it should, and it doesn't work on codes at the end of the line unless &fullscan is turned on. – Andru Luvisi Jan 10 '09 at 00:26

Mr.Ree · Answer 6 · 2009-01-11T05:24:34.813

1

Ya know, as long as we're off topic here, perl substitution has an 'e' option. As in evaluate expression. E.g.

echo "Hello, here's a test colon:. Here's a test semi-colon;
Further test &#65;. abc.~.def."
| perl -we 'sub translate { my $x=$_[0]; if ( ($x >= 32) && ($x <= 126) )
{ return sprintf("%c",$x); } else { return "&#".$x.";"; } }
while (<>) { s/&#(1?\d\d);/&translate($1)/ge; print; }'

Pretty-printing that:

#!/usr/bin/perl -w

sub translate
{
  my $x=$_[0];

  if ( ($x >= 32) && ($x <= 126) )
  {
    return sprintf( "%c", $x );
  }
  else
  {
    return "&#" . $x . ";" ;
  }
}

while (<>)
{
  s/&#(1?\d\d);/&translate($1)/ge;
  print;
}

Though perl being perl, I'm sure there's a much better way to write that...

Back to C code:

You could also roll your own finite state machine. But that gets messy and troublesome to maintain later on.

edited Jan 11 '09 at 05:24

answered Jan 09 '09 at 23:57

Mr.Ree

8,320
27
30

@mrree: I've posted another perl one-liner http://stackoverflow.com/questions/428013/regex-replacing-58-to-etc#431247 – jfs Jan 10 '09 at 16:37
JF: Umm, thanks. I wondered why the perl code no longer ran. I t hought I must have been sleep-coding when I used single-quotes instead of double-quotes in that echo. (It's fixed now.) – Mr.Ree Jan 11 '09 at 05:26
To show examples correctly replace `` by `&#` in the markup. btw, single quotes work with `echo` just fine. – jfs Jan 12 '09 at 15:48
"&": I didn't realize SO wasn't translating "&" correctly. I did add a number of '\\' chars to avoid SO misinterpreting other characters. Single/Double quotes have different effects on TCSH/BASH. In this particular case, there are single quotes in the text. echo 'here's a' vs echo "here's a". – Mr.Ree Jan 12 '09 at 22:52

score 1 · Answer 7 · edited May 23 '17 at 12:19

1

Here's another Perl's one-liner (see @mrree's answer):

a test file:

$ cat ent.txt 
Hello, &#12; here's a test colon&#58;. 
Here's a test semi-colon&#59; '&#131;'

the one-liner:

$ perl -pe's~(1?\d\d);~
> sub{ return chr($1) if (31 < $1 && $1 < 127); $& }->()~eg' ent.txt

or using more specific regex:

$ perl -pe"s~(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);~chr($1)~eg" ent.txt

both one-liners produce the same output:

Hello, &#12; here's a test colon:. 
Here's a test semi-colon; '&#131;'

edited May 23 '17 at 12:19

Community

1
1

answered Jan 10 '09 at 16:25

jfs

399,953
195
994
1,670

Very clever! I'm impressed! Once piece of advice: You might want to use '|' rather than '!' as the s// separator character if you plan to run this on the command-line in CSH/TCSH. (! is special even inside single-quotes.) – Mr.Ree Jan 10 '09 at 20:18
@mrree: I've replaced '!' by '~'. – jfs Jan 11 '09 at 00:50

jfs · Answer 8 · 2009-01-21T15:32:36.100

boost::spirit parser generator framework allows easily to create a parser that transforms desirable NCRs.

// spirit_ncr2a.cpp
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_core.hpp>

int main() {
  using namespace BOOST_SPIRIT_CLASSIC_NS; 

  std::string line;
  while (std::getline(std::cin, line)) {
    assert(parse(line.begin(), line.end(),
         // match "&#(\d+);" where 32 <= $1 <= 126 or any char
         *(("&#" >> limit_d(32u, 126u)[uint_p][&putchar] >> ';')
           | anychar_p[&putchar])).full); 
    putchar('\n');
  }
}

compile:

    $ g++ -I/path/to/boost -o spirit_ncr2a spirit_ncr2a.cpp

run:

    $ echo "Hello, &#12; here's a test colon&#58;." | spirit_ncr2a

output:

    "Hello, &#12; here's a test colon:."

score 0 · Answer 9 · answered Jan 09 '09 at 13:53

0

I did think I was pretty good at regex but I have never seen lambdas been used in regex, please enlighten me!

I'm currently using python and would have solved it with this oneliner:

''.join([x.isdigit() and chr(int(x)) or x for x in re.split('&#(\d+);',THESTRING)])

Does that make any sense?

answered Jan 09 '09 at 13:53

UlfR

4,175
29
45

If you're going for pithiness, (pythyness?), you can take out the [] brackets. FYI. – recursive Jan 09 '09 at 15:26
I added a Python example to my answer in case you're still curious. Instead of the lambda you could call a named/regular function. – PEZ Jan 10 '09 at 00:06
Your answer makes sense, but I think it's clearer to do a re.sub(). – PEZ Jan 10 '09 at 00:07

jfs · Answer 10 · 2009-01-11T21:23:53.817

Here's a NCR scanner created using Flex:

/** ncr2a.y: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
  /**recursive: unput(atoi(yytext + 2)); skip '&#'; `atoi()` ignores ';' */
  fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}

To make an executable:

$ flex ncr2a.y
$ gcc -o ncr2a lex.yy.c -lfl

Example:

$ echo "Hello, &#12; here's a test colon&#58;. 
> Here's a test semi-colon&#59; '&#131;'
> &#38;#59; <-- may be recursive" \
> | ncr2a

It prints for non-recursive version:

Hello, &#12; here's a test colon:.
Here's a test semi-colon; '&#131;'
&#59; <-- may be recursive

And the recursive one produces:

Hello, &#12; here's a test colon:.
Here's a test semi-colon; '&#131;'
; <-- may be recursive

score 0 · Answer 11 · answered Jan 18 '09 at 17:27

This is one of those cases where the original problem statement apparently isn't very complete, it seems, but if you really want to only trigger on cases which produce characters between 32 and 126, that's a trivial change to the solution I posted earlier. Note that my solution also handles the multiple-patterns case (although this first version wouldn't handle cases where some of the adjacent patterns are in-range and others are not).

      dd = "0123456789"
      ccp = "#" span(dd) $ n *lt(n,127) *ge(n,32) ";" *?(s = s char(n))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = line                             :(rdl)
 done
 end

It would not be particularly difficult to handle that case (e.g. ;#131;#58; produces ";#131;:" as well:

      dd = "0123456789"
      ccp = "#" (span(dd) $ n ";") $ enc
 +      *?(s = s (lt(n,127) ge(n,32) char(n), char(10) enc))
 +      fence (*ccp | null)
 rdl  line = input                              :f(done)
 repl line "&" *?(s = ) ccp = s                 :s(repl)
      output = replace(line,char(10),"#")       :(rdl)
 done
 end

jfs · Answer 12 · 2009-01-21T06:01:17.830

Here's a version based on boost::regex_token_iterator. The program replaces decimal NCRs read from stdin by corresponding ASCII characters and prints them to stdout.

#include <cassert>
#include <iostream>
#include <string>
#include <boost/lexical_cast.hpp>
#include <boost/regex.hpp>

int main()
{
  boost::regex re("&#(1(?:[01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]);"); // 32..126
  const int subs[] = {-1, 1}; // non-match & subexpr
  boost::sregex_token_iterator end;
  std::string line;

  while (std::getline(std::cin, line)) {
    boost::sregex_token_iterator tok(line.begin(), line.end(), re, subs);

    for (bool isncr = false; tok != end; ++tok, isncr = !isncr) {
      if (isncr) { // convert NCR e.g., '&#58;' -> ':'
        const int d = boost::lexical_cast<int>(*tok);
        assert(32 <= d && d < 127);
        std::cout << static_cast<char>(d);
      }
      else
        std::cout << *tok; // output as is
    }
    std::cout << '\n';
  }
}

Regex Replacing : to ":" etc

12 Answers12

Linked