regex to remove hyphens and spaces

Question

I've got the string:

<u>40 -04-11</u>

How do I remove the spaces and hyphens so it returns 400411?

Currently I've got this:

(<u[^>]*>)(\-\s)(<\/u>)

But I can't figure out why it isn't working. Any insight would be appreciated.

Thanks

[You will bring doom upon us all!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — cwallenpoole, Sep 06 '11 at 04:32
ah, if I was trying to parse xml or html, I would use a html/xml parser, but it isn't either! It's just a lot of junk in a text file. — itwb, Sep 06 '11 at 04:39
@Eyquem When Pavlov rang a bell, his dogs all thought that it was time for dinner, and so their mouths watered — cwallenpoole, Sep 06 '11 at 11:19

Paul Walls · Accepted Answer · 2011-09-06T05:54:52.187

6

(<u[^>]*>)(\-\s)(<\/u>)

Your pattern above doesn't tell your regex where to expect numbers.

(<u[^>]*>)(?:-|\s|(\d+))*(<\/u>)

That should get you started, but not being a python guy, I can't give you the exact replacement syntax. Just be aware that the digits are in a repeating capture group.

Edit: This is an edit in response to your comment. Like I said, not a python guy, but this will probably do what you need if you hold your tongue just right.

def repl(matchobj):
        if matchobj.group(1) is None:
            return ''
        else:
            return matchobj.group(1)

source = '<u>40 -04-11</u>40 -04-11<u>40 -04-11</u>40 -04-11'
print re.sub(r'(?:\-|\s|(\d+))(?=[^><]*?<\/u>)', repl, source)

Results in:

>>>'<u>400411</u>40 -04-11<u>400411</u>40 -04-11'

If the above offends the Python deities, I promise to sacrifice the next PHP developer I come across. :)

edited Sep 06 '11 at 05:54

answered Sep 06 '11 at 04:29

Paul Walls

5,884
2
22
23

I think I can use this for matching, but how do you replace? – itwb Sep 06 '11 at 04:33
@itwb I edited the answer to include an (admittedly rough) Python example. – Paul Walls Sep 06 '11 at 05:45
2

I don't think they want PHP developers (I know I don't). – tripleee Sep 06 '11 at 06:02

score 3 · Answer 2 · answered Sep 06 '11 at 04:11

3

You don't really need a regex, you could use :

>>> '<u>40 -04-11</u>'.replace('-','').replace(' ','')
'<u>400411</u>'

answered Sep 06 '11 at 04:11

wim

338,267
99
616
750

This is just one really small part of the puzzle. I've probably got around 200 unsigned int values which are malformed. they're all in the same document, I just need to clear out spaces and hyphens programatically. (only between the tags and tags) – itwb Sep 06 '11 at 04:19
2

Use an HTML or XML parser, visit each `` node, apply wim's double-replace, then replace the content of the `` node with the patched text. – mu is too short Sep 06 '11 at 04:34
it's not xml, it doesn't have any formatting, even though it seems like it. stands for unsigned int, in this situation – itwb Sep 06 '11 at 04:36

ikegami · Answer 3 · 2011-09-06T04:34:16.727

Using Perl syntax:

s{
   (<u[^>]*>) (.*?) (</u>)
}{
   my ($start, $body, $end) = ($1, $2, $3);
   $body =~ s/[-\s]//g;
   $start . $body . $end       
}xesg;

Or if Python doesn't have an equivalent to /e,

my $out = '';
while (
   $in =~ m{
      \G (.*?) 
      (?: (<u[^>]*>) (.*?) (</u>) | \z )
   }sg
) {
   my ($pre, $start, $body, $end) = ($1, $2, $3, $4);
   $out .= $pre;
   if (defined($start)) {
       $body =~ s/[-\s]//g;
       $out .= $start . $body . $end;
   }
}

score 1 · Answer 4 · answered Sep 06 '11 at 05:09

I'm admittedly not very good at regexes, but the way I would do this is by:

Doing a match on a <u>...</u> pair
doing a re.sub on the bit between the match using group().

That looks like this:

example_str = "<u>   76-6-76s</u> 34243vvfv"
tmp = re.search("(<u[^>]*>)(.*?)(<\/u>)",example_str).group(2)
clean_str = re.sub("(\D)","",tmp)
>>>'76676'

eyquem · Answer 5 · 2011-09-06T07:17:17.727

You should expose correctly your problem. I firstly didn't exactly understand it.

Having read your comment (only between the tags <u> and </u> tags) , I can now propose:

import re

ss = '87- 453- kol<u>40 -04-11</u> maa78-55 98 12'

print re.sub('(?<=<u>).+?(?=</u>)',
             lambda mat: ''.join(c for c in mat.group() if c not in ' -'),
             ss)

result

87- 453- kol<u>400411</u> maa78-55 98 12

regex to remove hyphens and spaces

5 Answers5