0

I am parsing an email and trying to get the numeric values from the email. My algorithm splits the email into an array split by spaces, and just extracts the values from that array. Problem is the html gets picked up, so an element in the array looks like this

   pre-wrap;">32.35</pre>  
   </td>  
   </tr>  

I want to just extract the digit and tried to filter it out, but its ignoring the decimal

this is the method

extractedValue = ''.join(filter(lambda i: i.isdigit(), firstString)) 

this returns 3235 and ignores the decimal.

What is the work around for this?

stack flow
  • 75
  • 6
  • 1
    `re.search(r'(\d+\.\d+)', firstString).group(1)`? – geckos Nov 07 '19 at 21:33
  • 1
    that actually worked! thank you! @geckos – stack flow Nov 07 '19 at 21:37
  • 1
    If you are parsing an html document you should actually parse the html with `BeautifulSoup` or similar, get all the visible text (https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text) then regex that for numbers. Don't parse html with regex, unless you have a good reason. Required reading: https://blog.codinghorror.com/parsing-html-the-cthulhu-way/ – qwwqwwq Nov 07 '19 at 22:22

1 Answers1

0

As in the comment, here is how to do it with regexps

re.search(r'(\d+\.\d+)', firstString).group(1)

In fact we can use .group() to get the whole match, which save you some key strokes.

foo = re.search("foo", sometring).group()
bar = re.search("bar", something).group()

If search didn't matched it returns None, so this expands to None.group() which raises AttributeError, so you can catch AttributeError for any non-match

try:
    foo = re.search("foo", sometring).group()
    bar = re.search("bar", something).group()
except AttributeError:
    pass
    # something went wrong

So you can achieve the same result with re.search(r'\d+\.\d+', firstString).group()

I hope this helps, Regards

geckos
  • 5,687
  • 1
  • 41
  • 53