1

I want to be able to remove numbers with decimal places from within a string of text using regex. See here

import re
obj = '''This is my #1 user1234@gmail.com <body/> 2 3 4 5 2345! 23542 312453 76666374 56s34534 
        1. _12345_blah@gmail.com 
        1978-12-01 12:00:00 1.23 21.243
        <script>function stripScripts(s) {
            var div = document.createElement('div');
            div.innerHTML = s;
            var scripts = div.getElementsByTagName('script');
            var i = scripts.length;
            while (i--) {
              scripts[i].parentNode.removeChild(scripts[i]);
            }
            return div.innerHTML;
          }</script> 99.258 245.643.3456!'''
regex1 = re.compile('(?is)(<script[^>]*>)(.*?)(</script>)|(<.*?>)|(?<!\S)\d+(?!\S)')
out1 = re.sub(regex1, ' ', obj)
print out1

data = ' '.join(out1.split()).strip()
print data 

This regex removes most of what I need it to but leaves 1.23, 21.243 and 99.258. I would like to append this current regex to remove those values as well...

regex = (?is)(<script[^>]*>)(.*?)(</script>)|(<.*?>)|(?<!\S)\d+(?!\S)

admdrew
  • 3,790
  • 4
  • 27
  • 39
aeupinhere
  • 2,883
  • 6
  • 31
  • 39
  • Are you sure you're not just parsing HTML? [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) might be a good idea if that's the case. – Al.Sal Jul 31 '14 at 18:20
  • Yeah...This is text from all over the place and I am trying to clean it up a bit. HTML is just part of it but is causing the most problems. – aeupinhere Jul 31 '14 at 18:38
  • 1
    Obligatory link for people "parsing" non-regular HTML with regular expressions. http://stackoverflow.com/a/1732454/1301972 – Todd A. Jacobs Jul 31 '14 at 18:46

2 Answers2

2
re.sub("\d*\.\d+","",the_text)

wouldnt work? or maybe

re.sub("(\d*\.\d+)|(\d+\.[0-9 ]+)","",the_text)
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
0

Thanks @Joran Beasley! I tried this and it worked.

(?is)(<script[^>]*>)(.*?)(</script>)|(<.*?>)|(?<!\S)\d+(?!\S)|([0-9]+\.[0-9]+ )

What is the advantage of adding the first "d" here?

(\d+\.[0-9 ]+)
aeupinhere
  • 2,883
  • 6
  • 31
  • 39