PHP: I need something like split(), but

Question

So I'm actually storing an html field, but I'd like to add some pseudo tags to make it easier to publish. I.E. I want to wrap the title/headers into this tag: << ... >> E.G. << My Header >> Then I would enumerate them, format, and display the text beneath.

E.G.:

<<News>>
Breaking news on Sunday.
Have been taking hostages.
<<General Information>>
We would want to recieve our blabla.
And you want it.
<<User Suggestions>>
Yeah we want it so much...

Should actually display:

<H1 class="whatever" ID="Product_Header_1">News<H1>
Breaking news on Sunday.
Have been taking hostages.
<H1 class="whatever" ID="Product_Header_2">General Information</H1>
We would want to recieve our blabla.
And you want it.
<H1 class="whatever" ID="Product_Header_3">User Suggestion</H1>
Yeah we want it so much...

And then should return an array with actual headers and their number, so I could use it elsewhere on the page to make references.

So It seems we could either replace them directly, but that might get problematic with enumerating and returning the values, and would probably fail in case of not closed tags.

Or, to split them into into an array and then proceed manually, which seems like a better way to go.

This is what I tried so far:

$TEXT_A=preg_split('/<<([^>]+)>>/', $TEXT);

foreach($TEXT_A as $key => $val){
    if ($key>0) echo "<br>-!-";
    echo $val;
}

Where $TEXT is out HTML Text with pseudo-tags.

The problem though, split does not return the regexp match itself, so I'm getting puzzled on how to extract it. Maybe I would need to write some custom function that would return an array of texts AND headers, instead of regular split, but I don't know where to start...

Please help.

You're really describing a parser, not a string manipulation function. I would suggest going down that road. You could, however, split on `<<` and then loop through each array indice and get a substring up to the position of `>>`, which would give you the title, and then start from that position and substring out the text. — Jared Farrish, Jun 11 '12 at 22:11

Walter Tross · Accepted Answer · 2012-06-11T22:32:54.263

2

Just use

$text_a = preg_split('/<<([^>]+)>>/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

You'll find your header tags at the odd indices of $text_a. Supposing you want to ignore what precedes the first header:

$n = count($text_a);
$head_a = array();
$body_a = array();
for ($i = 1; $i < $n; $i += 2) {
   $head_a[] = $text_a[$i];
   $body_a[] = $text_a[$i + 1]; // trim()?
}

edited Jun 11 '12 at 22:32

answered Jun 11 '12 at 21:58

Walter Tross

12,237
2
40
64

Nice... it's working. Though, they are just all being stuffed into one array that way. Guess I would have to make pairings by checking $key for even and odd values. – Anonymous Jun 11 '12 at 22:02
Thanks, I tested this solution and it seems to provide the fastest result, yet remains compact. You win :-) – Anonymous Jun 11 '12 at 22:49

nickb · Answer 2 · 2012-06-11T22:32:23.497

Here is a working solution using preg_replace_callback. It uses a non-greedy capturing group combined with a positive lookahead ((?=<<|$)) to capture the "body" text. The positive lookahead says "assert that either the opening delimiter << or the end of the string $ is present".

$count = 0;
$TEXT_A = preg_replace_callback( '/<<([^>]+)>>(.*?)(?=<<|$)/s', 
    function( $matches) use (&$count) {
        $count++;
        return '<H1 class="whatever" ID="Product_Header_' . $count . '">' . $matches[1] . '</H1>' . "\n" . trim( $matches[2]) . "\n\n"; 
}, $TEXT);
echo htmlentities( $TEXT_A);

I passed it through htmlentities to show the HTML generated, but you can of course remove that call to see the HTML get interpreted by the browser:

<H1 class="whatever" ID="Product_Header_1">News</H1>
Breaking news on Sunday.
Have been taking hostages.

<H1 class="whatever" ID="Product_Header_2">General Information</H1>
We would want to recieve our blabla.
And you want it.

<H1 class="whatever" ID="Product_Header_3">User Suggestions</H1>
Yeah we want it so much...

Demo

Edit:

Here is a solution without anonymous functions:

function do_replacement( $matches){
    static $count = 0;
    $count++;
    return '<H1 class="whatever" ID="Product_Header_' . $count . '">' . $matches[1] . '</H1>' . "\n" .
    trim( $matches[2]) . "\n\n";    
}

$TEXT_A = preg_replace_callback( '/<<([^>]+)>>(.*?)(?=<<|$)/s', 'do_replacement', $TEXT);
echo htmlentities( $TEXT_A);

Final edit

This edit includes a global array to capture the replacements.

$custom_array = array();
function do_replacement( $matches){
    global $custom_array;
    static $count = 0;
    $count++;
    $custom_array[$count] = $matches[1];
    return '<H1 class="whatever" ID="Product_Header_' . $count . '">' . $matches[1] . '</H1>' . "\n" .
    trim( $matches[2]) . "\n\n";    
}

$TEXT_A = preg_replace_callback( '/<<([^>]+)>>(.*?)(?=<<|$)/s', 'do_replacement', $TEXT);
echo htmlentities( $TEXT_A);

var_dump( $custom_array);

It looks very promising and I like the regex used, but for some reason it returns: ""Parse error: syntax error, unexpected T_FUNCTION"" — Anonymous, Jun 11 '12 at 22:10
You need to have PHP > 5.3 to use anonymous functions. I can rewrite it to not include anonymous functions, one sec. — nickb, Jun 11 '12 at 22:11
This is working nice, now I would only need to extract the $count and $matches[1] into an other array so I could make some to them... But I can't think of any good way with this new approach. Please advice ^.^ — Anonymous, Jun 11 '12 at 22:21
@user1125062 - If you were using PHP 5.3 you could use a closure. Since you're not, I used a global variable, and I'm not sure of any other way to get this done without a global array. Try the new code, should work as expected. — nickb, Jun 11 '12 at 22:33
Thanks alot, I learned alot from this, and I believe this shall benefit to many users too. It seems though the answer provided by Walter Tross does the same thing but in a more compact way, and is much faster perfomrance-wise, even if append the strings and trim accordingly. Thanks anyway, keep the great job! — Anonymous, Jun 11 '12 at 22:48

score 1 · Answer 3 · edited May 23 '17 at 12:12

It sounds like you want to write documents using a markup format, but not HTML.

This is quite a common requirement, and there are a number of solutions for this that people have already come up with. It's fine if you want to also create your own markup format, but if you want to save a bit of time, you may want to consider one of the existing ones.

Off the top of my head, I can think of BBCode, Markdown and Wikicode.

Markdown is the format used in the questions/comments on this site.
BBCode is used in various guises in a lot of forum software and the like.
Wikicode is the markup code used by Wikipedia and other wiki sites.

Parsers are available for all of these in PHP, as well as other languages.

For example, there is a BBCode parser available in PHP's PECL Library -- see here: http://php.net/manual/en/book.bbcode.php. If you're able to install PECL libraries onto your server, you can get these BBCode parsing functions available in your PHP without having to include anything at runtime.

Other BBCode parsers also exist if you can't go the PECL route: try this one, for example: http://nbbc.sourceforge.net/

Wiki markup parsers: Which wiki markup parser does Wikipedia use?

Markdown parser: http://michelf.com/projects/php-markdown/

Hope that helps.

I agree, I think this is more about parsing than string manipulation. A custom parser may not be that difficult to write, but a known solution might be available already. I mean, this is what BBCode is for. — Jared Farrish, Jun 11 '12 at 22:15
The other point is security. Writing your own markup parser could leave you open to security holes. You would almost certainly be better off using a well-established existing solution that's already been properly tested. — Spudley, Jun 11 '12 at 22:26

score 0 · Answer 4 · answered Jun 11 '12 at 22:05

Not a regex, but...:

$s = '<<News>>
Breaking news on Sunday.
Have been taking hostages.
<<General Information>>
We would want to recieve our blabla.
And you want it.
<<User Suggestions>>
Yeah we want it so much...';

$s = str_replace('>>', '<H1>', $s);
$i = 1;
while (strpos($s, '<<') !== false)
{
    $s = str_replace_one('<<', '<H1 class="whatever" ID="Product_Header_' . $i . '">', $s);
    $i++;
}

function str_replace_one($find, $replace, $subject) 
{
    return implode($replace, explode($find, $subject, 2));
}


echo $s;

score 0 · Answer 5 · answered Jun 11 '12 at 22:09

Why not using a preg_replace_callback instead?

preg_replace_callback('/<<([^>]+)>>/', function($match) {
    static $key=0;
    $html = (($key > 0) ? '<br>-!-' : '') . '<H1 class="whatever" ID="Product_Header_'.$key.'">'.$val.'</H1>';
    $key++;
    return $html;
});

PHP: I need something like split(), but

5 Answers5