0

I am parsing some transactions, for example 3 transactions look like this:

<TR class=DefGVRow>
<TD>29/04/2013</TD>
<TD><A href="javascript:__doPostBack('ctl00$cp$GVMov','Deposito$29/04/2013|0140959158|+|0,00')">DEPOSITO 0140959158</A></TD>
<TD>0140959158</TD>
<TD align=right>336,00</TD>
<TD align=center>+</TD>
<TD align=right>16.210,60</TD></TR>H
<TR class=DefGVAltRow>
<TD>29/04/2013</TD>
<TD>RETIRO ATM CTA/CTE</TD>
<TD>1171029739</TD>
<TD align=right>600,00</TD>
<TD align=center>-</TD>
<TD align=right>15.610,60</TD></TR>
<TR class=DefGVRow>
<TD>29/04/2013</TD>
<TD>C.SERV.CAJERO AUT.</TD>
<TD>1171029739</TD>
<TD align=right>3,25</TD>
<TD align=center>-</TD>
<TD align=right>15.607,35</TD></TR>

And my current Regex is:

<TR class=\w+>
<TD>(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>\d{4})</TD>
<TD>(?<description>.+?)</TD>
<TD>(?<id>\d{3,30})</TD>
<TD.+?>(?<amount>[\d\.]{1,20},\d{1,10})</TD>
<TD.+?>(?<info>.+?)</TD>
<TD.+?>(?<balance>[\d\.]{1,20},\d{1,10})</TD></TR>

How can I edit the

<TD>(?<description>.+?)</TD>

To process optional tags that match other parts of the same extraction? (basically: how to ignore the A tag when capturing the group)

Thanks!

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
eried
  • 398
  • 5
  • 15
  • 2
    For the sake of your sanity, parse this HTML. – Blender May 01 '13 at 01:42
  • 2
    Regex is not the best thing to be parsing html/xml with. Look into [XmlDocument](http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx), you can use XPath to parse through it's elements and achieve your goal much easier. – Jean-Bernard Pellerin May 01 '13 at 01:42
  • **Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester May 01 '13 at 02:30

2 Answers2

2

It is a very common problem. Please check this epic answer and stop using regexp to "parse" html, instead use a proper parser and get what you need with XPath or even a CSS selector.

Community
  • 1
  • 1
fotanus
  • 19,618
  • 13
  • 77
  • 111
  • 1
    I see. Anyway I found a way to define optional groups so I will answer my own question. BTW I am sure for parsing HTML regex is not the best, but in this case the html is VERY fixed – eried May 01 '13 at 02:13
  • OK, no problems if you feel this way. Just want to alert anyone else who stumble in this page. – fotanus May 01 '13 at 02:17
2

This removes the 'optional' link:

<TR class=\w+>
<TD>(?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>\d{4})</TD>
<TD>(?:<A href=".*>)?(?<description>.+?)(?:</A>)?</TD>
<TD>(?<id>\d{3,30})</TD>
<TD.+?>(?<amount>[\d\.]{1,20},\d{1,10})</TD>
<TD.+?>(?<info>.+?)</TD>
<TD.+?>(?<balance>[\d\.]{1,20},\d{1,10})</TD></TR>
eried
  • 398
  • 5
  • 15