3

I've stumbled on a bit of challenge here: how to get the contents of a table in HTML with the help of a regular expression. Let's say this is our table:

<table someprop=2 id="the_table" otherprop="val">
    <tr>
        <td>First row, first cell</td>
        <td>Second cell</td>
    </tr>
    <tr>
        <td>Second row</td>
        <td>...</td>
    </tr>
    <tr>
        <td>Another row, first cell</td>
        <td>Last cell</td>
    </tr>
</table>

I already found a method that works, but involves multiple regular expression to be executed in steps:

  1. Get the right table and put it's rows in back-reference 1 (there may be more than one in the document):

    <table[^>]*?id="the_table"[^>]*?>(.*?)</table>

  2. Get the rows of the table and put the cells in back-reference 1:

    <tr.*?>(.*?)</tr>

  3. And lastly fetch the cell contents in back-reference 1:

    <td.*?>(.*?)</td>

Now this is all good, but it would be infinitely more awesome to do this all using one fancy regular expression... Does someone know if this is possible?

Community
  • 1
  • 1
the_source
  • 648
  • 1
  • 6
  • 12
  • 7
    [You shouldn't try to parse HTML with RegEx](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Bohemian Sep 06 '11 at 09:49
  • 2
    Use (X)HTML or even XML parser to easily extract needed information without bloody regular expressions. – Daniel O'Hara Sep 06 '11 at 09:50
  • I agree with Bohemian and Dave Newman (hence the upvotes) but sometimes the problem is simple enough to be solved with regex. This looks too complicated for one (understandable) regex though. – Paul Grime Sep 06 '11 at 09:52
  • 5
    Yes i know all that, but i'm not going to incorporate a xml-parser just for this single task in my program. And why would anyone vote my question down? It is clear, it shows research effort... wtf man? – the_source Sep 06 '11 at 09:55
  • @the_source Thanks to the link given to you by Bohemian, everyone that asks for a Regex for HTML is marked as a n00b :-) (I'm fought between giving you a +1 because you showed effort and a -1 for asking for a Regex (in general! I'm against regexes!!). So in the end I'll give you a +0 and mark one of your comments as a Great Comment!) – xanatos Sep 06 '11 at 13:24

2 Answers2

4

There really isn’t a possible regex solution that works for an arbitrary number of table data and puts each cell into a separate back reference. That’s because with backreferences, you need to have a distinct open paren for each backref you want to create, and you don’t know how many cells you have.

There’s nothing wrong with using looping of one or another sort to pull out the data. For example, on the last one, in Perl it would be this, given that $tr already contains the row you need:

@td = ( $tr =~ m{<td.*?>(.*?)</td>}sg );

Now $td[0] will contain the first <td>, $td[1] will contain the second one, etc. If you wanted a two-dimensional array, you might wrap that in a loop like this to populate a new @cells variable:

our $table;  # assume has full table in it
my @cells;
while(my($tr) =~ $table = m{<tr.*?>(.*?)</tr>}sg) {
    push @cells, [ $tr =~ m{<td.*?>(.*?)</td>}sg ];
}

Now you can do two-dimensional addressing, allowing for $cells[0][0], etc. The outer explicit loop processes the table a row at a time, and the inner implicit loop pulls out all the cells.

That will work on the canned sample data you showed. If that’s good enough for you, then great. Use it and move on.

What Could Ever Be Wrong With That?

However, there are actually quite a few assumptions in your patterns about the contents of your data, ones I don’t know that you’re aware of. For one thing, notice how I’ve used /s so that it doesn’t get stuck on newlines.

But the main problem is that minimal matches aren’t always quite what you want here. At least, not in the general case. Sometimes they aren’t as minimal as you think, matching more than you want, and sometimes they just don’t match enough.

For example, a pattern like <i>(.*?)</i> will get more than you want if the string is:

<i>foo<i>bar</i>ness</i>

Because you will end up matching the string <i>foo<i>bar</i>.

The other common problem (and not counting the uncommon ones) is that a pattern like <tag.*?> may match too little, such as with

<img alt=">more" src="somewhere">

Now if you use a simplistic <img.*?> on that, you would only capture <img alt=">, which is of course wrong.

I think the last major remaining problem is that you have to altogether ignore certain things in parsing. The simplest demo of this embedded comments (also <script>, <style>, andCDATA`), since you could have something like

<i> some <!-- secret</i>  --> stuff </i>

which will throw off something like <i>(.*?)</i>.

There are ways around all these, of course. Once you’ve done so, and it is really quite a bit of effort, you’ll find that you have built yourself a real parser, completely with a lot of auxiliary logic, not just one pattern.

Even then you are only processing well-formed input strings. Error recovery and failing softly is an entirely different art.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • Thanks, I already use this approach right now, though be in C++. You made it clear to me it can't be done in one regex. I don't mind the quirks I obtain with it: IMO `foobarness` is bad HTML and shouldn't `>more` be `>more`? But it doesn't matter, I only need the contents of the cells, some other piece of code cleans the rest :) – the_source Sep 06 '11 at 13:49
2

This answer was added before it was known the the OP needed a solution for c++...

Since using regex to parse html is technically wrong, I'll offer a better solution. You could use js to get the data and put it into a two dimensional array. I use jQuery in the example.

var data = [];
$('table tr').each(function(i, n){
    var $tr = $(n);
    data[i] = [];
    $tr.find('td').text(function(j, text){
        data[i].push(text);
    });
});

jsfiddle of the example: http://jsfiddle.net/gislikonrad/twzM7/

EDIT

If you want a plain javascript way of doing this (not using jQuery), then this might be more for you:

var data = [];
var rows = document.getElementById('the_table').getElementsByTagName('tr');
for(var i = 0; i < rows.length; i++){
    var d = rows[i].getElementsByTagName('td');
    data[i] = [];
    for(var j = 0; j < d.length; j++){
       data[i].push(d[j].innerText);     
    }
}

Both these functions return the data the same way.

gislikonrad
  • 3,401
  • 2
  • 22
  • 24
  • I'm working in a Win32 C++ environment, but I'll try looking into something else than regular expressions for this. Thanks for the effort though! – the_source Sep 06 '11 at 10:52
  • Ah... ok... You hear "html" and you think "browser" :) – gislikonrad Sep 06 '11 at 10:59
  • 1
    +1 because... because... Ignorance is not an excuse! You should have read the mind of the OP! This +1 will tech you that! :-) :-) – xanatos Sep 06 '11 at 13:23
  • Haha, well it's a question about regular expressions in general. And adequate mind-reading skills seem a must at StackOverflow sometimes ;) – the_source Sep 06 '11 at 13:51