0

I am very lost with this Regex. I have a HTML Table with 3 Field:Date,Name and Place. The first record of table don't have field "Place"(i cannot change table format)... At the moment i am using pattern below:

^<tr><td.*>(.+)<\/td><td>(.+)<\/td><td><font.*>(.+)<\/font><\/td><\/tr> $\n<tr><td.*>(.+)<\/td><\/tr>

This pattern ignores the first record of table(this record don't have field "Place"). I don't want create 2 Pattern for same text. Can anyone help with this issue?

A sample of table:

<table  border cellpadding=1 hspace=10> 
<colgroup style='font:8pt Tahoma;color=Black' valign=top><colgroup style='font:8pt Tahoma; color=Navy'><colgroup style='font:8pt Tahoma;color=Maroon'> 
<tr> 
<td><font FACE=Tahoma color='#CC0000' size=2><b>Date</b></font></td> 
<td><font FACE=Tahoma color='#CC0000' size=2><b>Name</b></font></td> 
<td><font FACE=Tahoma color='#CC0000' size=2><b>Place</b></font></td> 
</tr> 
<tr><td rowspan=2>17/08/2011 10:28</td><td>Vivamus sed est ut lorem tempor cursus</td><td><FONT COLOR="000000">Curabitur egestas metus bibendum</font></td></tr> 
<tr><td colspan=2>Curabitur id urna elit</td></tr> 
<tr><td rowspan=2>17/08/2011 10:26</td><td>UDonec blandit nisl ut nisl elementum</td><td><FONT COLOR="000000"> hendrerit vel ante</font></td></tr> 
<tr><td colspan=2>Etiam nec mollis</td></tr> 
<tr><td rowspan=2>12/08/2011 09:46</td><td>Nulla et eros a massa</td><td><FONT COLOR="000000">Aenean in mauris eget tellus </font></td></tr> 
<tr><td colspan=2>Nulla et eros a massa tristique blandit </td></tr> 
<tr><td rowspan=2>12/08/2011 09:45</td><td>orta mi dapibus sit amet. Vestib</td><td><FONT COLOR="000000"> mollis erat consectetur.</font></td></tr> 
<tr><td colspan=2>sodales tempor</td></tr> 
<tr><td rowspan=1>11/08/2011 10:39</td><td>lorem ipsum</td><td><FONT COLOR="000000">dolor</font></td></tr>
</TABLE> 

The current solution is create 2 regexp. The first regex catch table without first record:

^<tr><td.*>(.+)<\/td><td>(.+)<\/td><td><font.*>(.+)<\/font><\/td><\/tr> $\n<tr><td.*>(.+)<\/td><\/tr>

And the second regex capture first record:

^<tr><td.*>(.+)<\/td><td>(.+)<\/td><td><font.*>(.+)<\/font><\/td><\/tr> $
Stefhan
  • 610
  • 1
  • 5
  • 17
  • 6
    I would recommend against using regex to parse HTML. Try some HTML parser. – Markus Hedlund Aug 21 '11 at 23:00
  • [You shouldn't try to parse HTML with RegEx](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Bohemian Aug 21 '11 at 23:24
  • Ignore these kneejerk answers. [You can certainly parse HTML with modern patterns](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491). However, anyone who has to ask how to do that probably doesn't have the skillset to carry it out, and showing you how might do more harm than good. Still, it looks like this is one of those **you should use regexes** for. Is this with a toy language, or do you have PCRE or better? You have not asked a real question here. Show desired output, not just desired input. **Use `.*?` not `.*` !** – tchrist Aug 22 '11 at 00:21

3 Answers3

1

More formally, XML and associated languages are not regular languages, which is why they are unsuited for parsing by regular expressions. Short of writing your own recursive descent parser, your best bet is to use an existing solution.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
justkris
  • 800
  • 4
  • 15
1

Because this is such an easy problem, it does not require anything hard. What you were missing is that you need to use minimal matching quantifiers instead of maximal matching ones. You also need casefolding because of font vs FONT.

This is a trivial demo of one simplistic approach that works for your captive/canned/fixed dataset:

#!/usr/bin/env perl

while (<DATA>) {
    print "FONT='$1' CONTENTS='$2'\n" while m{
        <td [^<>]*? >
            <font \s+ ([^<>]*?) >
                ( .*? )
            </font>
        </td>
    }gsix;
}
__END__
<table  border cellpadding=1 hspace=10>
<colgroup style='font:8pt Tahoma;color=Black' valign=top><colgroup style='font:8pt Tahoma; color=Navy'><colgroup style='fon
t:8pt Tahoma;color=Maroon'>
<tr>
<td><font FACE=Tahoma color='#CC0000' size=2><b>Date</b></font></td>
<td><font FACE=Tahoma color='#CC0000' size=2><b>Name</b></font></td>
<td><font FACE=Tahoma color='#CC0000' size=2><b>Place</b></font></td>
</tr>
<tr><td rowspan=2>17/08/2011 10:28</td><td>Vivamus sed est ut lorem tempor cursus</td><td><FONT COLOR="000000">Curabitur eg
estas metus bibendum</font></td></tr>
<tr><td colspan=2>Curabitur id urna elit</td></tr>
<tr><td rowspan=2>17/08/2011 10:26</td><td>UDonec blandit nisl ut nisl elementum</td><td><FONT COLOR="000000"> hendrerit ve
l ante</font></td></tr>
<tr><td colspan=2>Etiam nec mollis</td></tr>
<tr><td rowspan=2>12/08/2011 09:46</td><td>Nulla et eros a massa</td><td><FONT COLOR="000000">Aenean in mauris eget tellus
</font></td></tr>
<tr><td colspan=2>Nulla et eros a massa tristique blandit </td></tr>
<tr><td rowspan=2>12/08/2011 09:45</td><td>orta mi dapibus sit amet. Vestib</td><td><FONT COLOR="000000"> mollis erat conse
ctetur.</font></td></tr>
<tr><td colspan=2>sodales tempor</td></tr>
<tr><td rowspan=1>11/08/2011 10:39</td><td>lorem ipsum</td><td><FONT COLOR="000000">dolor</font></td></tr>
</TABLE>

will when run produce this output:

FONT='FACE=Tahoma color='#CC0000' size=2' CONTENTS='<b>Date</b>'
FONT='FACE=Tahoma color='#CC0000' size=2' CONTENTS='<b>Name</b>'
FONT='FACE=Tahoma color='#CC0000' size=2' CONTENTS='<b>Place</b>'
FONT='COLOR="000000"' CONTENTS='Curabitur egestas metus bibendum'
FONT='COLOR="000000"' CONTENTS=' hendrerit vel ante'
FONT='COLOR="000000"' CONTENTS='Aenean in mauris eget tellus '
FONT='COLOR="000000"' CONTENTS=' mollis erat consectetur.'
FONT='COLOR="000000"' CONTENTS='dolor'

In general, you want to pull this stuff out a piece at a time.

It would be better to generate it correctly in the first place, but one does what one must.

So, have you got any hard problems? :)

Sure, there are million things this doesn't handle, but so what?

  • First of all, if I have to handle any of those million things, I very most certainly can..
  • But more importantly, in well-defined HTML, those steps are not necessary, which means that simple patterns like this are perfectly fine.

Don't fall into the trap of overdesigning a million-dollar solution when you don't need it.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • Dissenting opinion: I agree that modern RegEx's are capable of parsing a wide range of inputs. But if a free domain-specific parser exists for this type of input (HTML, XML, etc.) on your platform, why reinvent the wheel? HTML parsers handle the edge cases so you don't have to worry about them. – TrueWill Aug 22 '11 at 01:10
  • @TrueWill: Many HTML editing tasks can be trivially taken care of in your editor with a search and replace -- and should be. For example, et's say you want to flip the color around on the cells of a particular table. It would be massive overkill to the point of irremediable idiocy to go off and write a decicated program that sucks in some god-awful-huge super-spectacular HTML parsing class with 1300 options just to do that. And that is what people hereabouts keep insisting be done. It's sheer nonsense. – tchrist Aug 22 '11 at 02:49
  • @TrueWill: When people say not to use regexes on HTML, they are effectively saying don't use a text editor to edit HTML, to go off and write a brand new program and eschew existing tried and true programs we've used for decades. HTML is not some finicky binary format needing extreme wizardry to diddle. It's text. **You can use text tools on HTML**, and it is a cruel disservice to tell people they cannot. – tchrist Aug 22 '11 at 03:01
  • 1
    Agreed. If it's known HTML and you're editing it in a text editor, search-and-replace (including with RegEx) can be a reasonable solution. – TrueWill Aug 22 '11 at 03:42
0

There is only one correct answer. Don't use regex to parse HTML. Just don't. Don't even think about it. It will bring you nothing but pain.

Winston Ewert
  • 44,070
  • 10
  • 68
  • 83
  • Wrong. [It's easy](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491). However, you don't need a real parser here. This is one of those limited situations where regexes are the optimal solution: very well-defined capture HTML. – tchrist Aug 22 '11 at 00:23
  • @tchrist, I wouldn't describe what you've done there as *easy*. As you point out, you don't *need* a real parser here. But in my view, using an HTML parser will produce more readable code then a regex. And because somebody else has already debugged my html parser, I don't need to worry about getting it right like I'd have to do with a regex. In the end, if you struggle getting the regex right for this, you shouldn't be using a regex. – Winston Ewert Aug 22 '11 at 03:17
  • Who says you need code? Maybe you're in an editor. It's hard to (usefully) call an HTML parser from a text editor. Imagine a vi command like `//
    /`. A text editor is for editing text. Until we forbid people from editing HTML using a text editor, these things will be forever asked, and it is a disservice to them to pretend a simple transformation such as my vi (well, ex) command above is inappropriate in all cases. That is silly. Regexes are the user-friendly solution for such as this. (Plus **my** regexes are certainly made to be read.)
    – tchrist Aug 22 '11 at 03:29
  • 1
    @tchrist, my answer is based on the assumption that the html is being manipulated in a script. The situation changes if the OP is editing the html directly. – Winston Ewert Aug 22 '11 at 03:46