-2

I need help with building regular expression for text separating. Now I have some text like

text text text
text text text
<div> text text text </div>
<table class="table1">
<tr>
<td>
</td>
</tr>
</table>
text text text
text text text
text text text
<table class="table2">
<tr>
<td>
</td>
</tr>
</table>
text text text
text text text
text text text

I need to create a regular expression that would separate the text and tables. Now I have regular expression

preg_match_all( "/(.*)(<table(?s).*?\/table>)(.*)/si", $value[ 'TEXT' ], $matches );

And this expression works fine for the text like

text text text
text text text
<div> text text text </div>
<table class="table1">
<tr>
<td>
</td>
</tr>
</table>

It separate to the

text text text
text text text
<div> text text text </div>

and

    <table class="table1">
    <tr>
    <td>
    </td>
    </tr>
    </table>

But for the text

text text text
text text text
<div> text text text </div>
<table class="table1">
<tr>
<td>
</td>
</tr>
</table>
text text text
text text text
text text text
<table class="table2">
<tr>
<td>
</td>
</tr>
</table>
text text text
text text text
text text text

my regular expression doesnot work. It's return array with

[0] =>"text text text
    text text text
    <div> text text text </div>
    <table class="table1">
    <tr>
    <td>
    </td>
    </tr>
    </table>
    text text text
    text text text
    text text text",
[1]=>"<table class="table2">
    <tr>
    <td>
    </td>
    </tr>
    </table>",
[2]=>"text text text
    text text text
    text text text"

How to build right regular expression?

Arthur
  • 3,253
  • 7
  • 43
  • 75
  • 1
    The [obligatory admonition](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Kerrek SB Sep 10 '12 at 08:23

3 Answers3

1

It should be somewhere around this:

$doc = new DOMDocument;
$doc->loadHTML('html string');

$tables = $doc->getElementsByTagName('table');
foreach($tables as $table){
    $parent = $table->parentNode;
    $parent->removeChild($table);
}

$doc->normalizeDocument();

$text = array();
$xpath = new DOMXPath($doc);
$textnodes = $xpath->evaluate('//text()');
foreach($textnodes as $textnode){
    $text[] = $textnode->wholeText;
}
print_r($text)

This code loads your html, find and removes tables, finds all the textnodes and fill an array with their content. You should read more about PHP DOM to fine tune it to your needs.

Ties
  • 5,726
  • 3
  • 28
  • 37
  • OK. It's remove tables. But I need text pieces in the original sequence for wrapping each piece in div. I.e. `
    piece1
    ***
    piece2
    ***
    piece3
    `
    – Arthur Sep 10 '12 at 08:44
  • Updated the code, if it still doesn't work you should google for `xpath` and `php dom` tutorials. They should help you out. After that doesn't work, ask a question about you're new code. – Ties Sep 10 '12 at 09:10
0

Get rid of the (.*) at the beginning and end of your regex. The only time you have to "pad" a regex like that is when you're using something like Java's matches() method that automatically anchors the match at both ends.

What's happening here is that the first (.*) initially gobbles up the whole document, then backs off just far enough to let the next part (<table etc.) match one table element. Then the second (.*) consumes whatever is left. That explains why preg_match_all() only captures one table element, and why it's always the last one.

You can get rid of the (?s) as well. It's not really hurting anything, but all it does is turn on single-line mode, and you've already done that with the s modifier at the end. You probably meant to match a whitespace character (which would be \s), but that would prevent it from matching <table> (i.e. a table tag with no attributes). You should use \b (a word boundary) instead:

preg_match_all( '~<table\b.*?/table>~si', $value[ 'TEXT' ], $matches );

But be aware that this approach will only work on extremely simple HTML. There are many, many things that can defeat it even in perfectly valid HTML (nested table tags being the most obvious example).

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
0

The best solution is this code:

$test = preg_replace( "/<table(?s).*?\/table>/si", '<BREAKHERE>', $value[ 'TEXT' ] );

            $texts = explode( '<BREAKHERE>', $test );

            foreach ( $texts as $keyTEXT => $valueTEXT )
            {
                $TmpVal = str_replace( "\r", "", $valueTEXT );
                $TmpVal = str_replace( "\n", "", $TmpVal );
                $TmpVal = str_replace( "\r\n", "", $TmpVal );
                if ( trim( $TmpVal ) != '' )
                {
                    preg_match_all( "/\w/", $TmpVal, $mtchs );

                    if ( count( $mtchs[ 0 ] ) > 0 )
                    {
                        $value[ 'TEXT' ] = str_replace( $valueTEXT, ' <div class="panel-container">' . $valueTEXT . '</div>', $value[ 'TEXT' ] );
                    }
                }
            }
Arthur
  • 3,253
  • 7
  • 43
  • 75