1

I have the below regex to identify text in a html tag that doesn't yields the result expected.

HTML Tag:

<td>Issue Amount</td>
<td>:</td>
<td>20,000,000.00</td>

Find = re.findall(?<=Issue Amount</td> <td>:</td> <td>) [0-9,]),soup_string)[0]

I need to get the numerical value 20,000,000.00 from this tag.

Any advise what am I doing wrong here. I did try couple of other ways but with no success.

  • Sounds like it has an answer [here](https://stackoverflow.com/questions/9833152/python-regular-expressions-extract-every-table-cell-content) or [here](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – vahdet Mar 01 '19 at 14:43
  • Try this regex `([\d,.]+)` [\d,.]+ : one or more times any digits or comma or point () : capturing group – nissim abehcera Mar 01 '19 at 15:20
  • Thanks nissim.. It works, however this is just a part of html body and there are chances it might end up matching other values as well. Of the complete body the regex should match only this part. – Shashi Shankar Singh Mar 01 '19 at 15:31

2 Answers2

2

Do not under any circumstances try to parse XML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.

Use an HTML parsing library see this page for some ways to do it.

However in your case you have mucked up your regex by looking for a space between your </td> and <td> tags. Whereas your data has carriage returns. You can use the \s meta-character to look for any white space character

JGNI
  • 3,933
  • 11
  • 21
0

Below is the regex piece that helped me get the desired output. Thanks all for your inputs.

(?<=Issue Amount[td\W]{21})([\d,.]+)