2

I've got the string:

<u>40 -04-11</u>

How do I remove the spaces and hyphens so it returns 400411?

Currently I've got this:

(<u[^>]*>)(\-\s)(<\/u>)

But I can't figure out why it isn't working. Any insight would be appreciated.

Thanks

itwb
  • 427
  • 2
  • 6
  • 15

5 Answers5

6
(<u[^>]*>)(\-\s)(<\/u>)

Your pattern above doesn't tell your regex where to expect numbers.

(<u[^>]*>)(?:-|\s|(\d+))*(<\/u>)

That should get you started, but not being a python guy, I can't give you the exact replacement syntax. Just be aware that the digits are in a repeating capture group.

Edit: This is an edit in response to your comment. Like I said, not a python guy, but this will probably do what you need if you hold your tongue just right.

def repl(matchobj):
        if matchobj.group(1) is None:
            return ''
        else:
            return matchobj.group(1)

source = '<u>40 -04-11</u>40 -04-11<u>40 -04-11</u>40 -04-11'
print re.sub(r'(?:\-|\s|(\d+))(?=[^><]*?<\/u>)', repl, source)

Results in:

>>>'<u>400411</u>40 -04-11<u>400411</u>40 -04-11'

If the above offends the Python deities, I promise to sacrifice the next PHP developer I come across. :)

Paul Walls
  • 5,884
  • 2
  • 22
  • 23
3

You don't really need a regex, you could use :

>>> '<u>40 -04-11</u>'.replace('-','').replace(' ','')
'<u>400411</u>'
wim
  • 338,267
  • 99
  • 616
  • 750
  • This is just one really small part of the puzzle. I've probably got around 200 unsigned int values which are malformed. they're all in the same document, I just need to clear out spaces and hyphens programatically. (only between the tags and tags) – itwb Sep 06 '11 at 04:19
  • 2
    Use an HTML or XML parser, visit each `` node, apply wim's double-replace, then replace the content of the `` node with the patched text. – mu is too short Sep 06 '11 at 04:34
  • it's not xml, it doesn't have any formatting, even though it seems like it. stands for unsigned int, in this situation – itwb Sep 06 '11 at 04:36
2

Using Perl syntax:

s{
   (<u[^>]*>) (.*?) (</u>)
}{
   my ($start, $body, $end) = ($1, $2, $3);
   $body =~ s/[-\s]//g;
   $start . $body . $end       
}xesg;

Or if Python doesn't have an equivalent to /e,

my $out = '';
while (
   $in =~ m{
      \G (.*?) 
      (?: (<u[^>]*>) (.*?) (</u>) | \z )
   }sg
) {
   my ($pre, $start, $body, $end) = ($1, $2, $3, $4);
   $out .= $pre;
   if (defined($start)) {
       $body =~ s/[-\s]//g;
       $out .= $start . $body . $end;
   }
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
1

I'm admittedly not very good at regexes, but the way I would do this is by:

  • Doing a match on a <u>...</u> pair
  • doing a re.sub on the bit between the match using group().

That looks like this:

example_str = "<u>   76-6-76s</u> 34243vvfv"
tmp = re.search("(<u[^>]*>)(.*?)(<\/u>)",example_str).group(2)
clean_str = re.sub("(\D)","",tmp)
>>>'76676'
John Lyon
  • 11,180
  • 4
  • 36
  • 44
1

You should expose correctly your problem. I firstly didn't exactly understand it.

Having read your comment (only between the tags <u> and </u> tags) , I can now propose:

import re

ss = '87- 453- kol<u>40 -04-11</u> maa78-55 98 12'

print re.sub('(?<=<u>).+?(?=</u>)',
             lambda mat: ''.join(c for c in mat.group() if c not in ' -'),
             ss)

result

87- 453- kol<u>400411</u> maa78-55 98 12
eyquem
  • 26,771
  • 7
  • 38
  • 46