Regex match and replace advanced

Question

I'm trying to write a little WordPress plugin to support some migrated content.

The syntax highlighter expects (for proper highlighting):

<pre lang='something'>
  <code>
    The code...
  </code>
</pre>

However, my markdown code has the following:

<pre>
  <code>
    :::something
    The code...
  </code>
</pre>

I think you can see where this is going. What I want to achieve is this:

:::something should be removed, and the <pre> tag should be updated to <pre lang="something">.
If :::something does not exist, the <pre> tag should be <pre lang="plain">
There may be multiple occurrences per page that need to be updated.

How would a PHP function achieving the above look like?

function set_syntax_lang($content) {
  // Do stuff here
  return $new_content;
}

What I gathered so far is this regex:

/<pre.*>\s*<code>\s*:::(\w)/

This even yields me, using preg_match, the actual syntax indicator (something), but I don't know how to update the pre-tag correctly.

It's been a very long time since I coded PHP and regexes are not really my strong suit. So all help is appreciated.

If you need to remove the end `` then you can run into actual problems doing this with regex. You can try, but there is no guarantee this it won't break your output. Which languages are inside the code-block? — hakre, Jun 15 '11 at 23:09
Okay not with that coffee-cup-swap-out scripts. Probably you should check the [SyntaxHighlighter Evolved](http://wordpress.org/extend/plugins/syntaxhighlighter/) Plugin. — hakre, Jun 15 '11 at 23:24

dynamic · Accepted Answer · 2011-06-15T23:22:44.153

1

Finding :::something

preg_replace( '/<pre(.*>\s*<code>\s*):::(\w+)/U', '<pre lang="$2"$1' , $html );

This is an edge-case. But normally I should advise you to NOT use regex for html (bobince someone?).

Also next time try be less verbouse on your question. I took more time to read you than to write this answer.

Finding code without :::something

preg_replace( '/<pre(.*>\s*<code>\s*)(?!:::\w+)/U', '<pre lang="plain"$1' , $html );

Fixing `<code>`

preg_replace( array( '/(<pre.*>)\s*<code>/U' , '/<\/code>\s*(<\/pre>)/U' ),
              '$1' , $html );
//> Completly untested

edited Jun 15 '11 at 23:22

answered Jun 15 '11 at 22:56

dynamic

46,985
55
154
231

untested. tell me if there are problems – dynamic Jun 15 '11 at 22:58
(\w) should be (\w+), then it works. But now .. is shown as the actual code. Seems I need to remove those tags as well. Still not working for a missing `:::something` though. Looking into that now. – Ariejan Jun 15 '11 at 23:06
what do you mean by actual code? Also yes i totally forgot about that missing :::something give me a sec – dynamic Jun 15 '11 at 23:06
@Ariejan: added the second regex you need for missing :::something – dynamic Jun 15 '11 at 23:11
@yes123 that works nicely. I'm only left now with `` and `` in my actual code. (wp-syntax does not pick up the nested pre > code. I'll have to remove those myself, but I think I can manage that one. – Ariejan Jun 15 '11 at 23:15
I sitll don't understand what you need to do with your `` Also if this helped you please think to +1 – dynamic Jun 15 '11 at 23:16
+1'd and considering this to accept as an answer. Here's the result I mean: http://staging.ariejan.net/2011/06/10/vows-and-coffeescript/ – Ariejan Jun 15 '11 at 23:17
Ah ok so you just simply need to remove `` tag? Is this all you need? – dynamic Jun 15 '11 at 23:18
Yup, but without damaging actualy `` tags in the code. The tags to be removed must be directly after and directly before the opening and closing `pre` tags. – Ariejan Jun 15 '11 at 23:19
I should get more than a merely +1 here :( – dynamic Jun 15 '11 at 23:24
@Ariejan - what about using a plugin for that with shortcodes? – hakre Jun 15 '11 at 23:25
1

@yes123: I find it odd that you frequently overtext people with [the stupid link (mostly wrong, btw)](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), yet handicraft a regex when it would actually have some relevancy for once. :> – mario Jun 15 '11 at 23:54
@mario: I don't get if you are serious or not. If you think that link is stupid why are you posting it again here? Also did you read in my answer? I said `NOT use regex for html (bobince someone?)`. With **bobince** I was myself referring to that link. So no need to do a paternal here. – dynamic Jun 16 '11 at 09:31
"You shouldn't do it - and here is the code on how to do it" always reads a bit amusing. But it very much invalidates the advise. And that specific link btw is widely [considered](http://meta.stackexchange.com/questions/73133/regex-and-html-the-long-tail-annoys-me/73168#73168) spam on stackoverflow. Will get flagged, and that's why it should be avoided usually. -- However, if you were to read the title once - it's about _nested tags_. And in this case OP asked about nested tags. So here it's the exception, and it would actually be a bit relevant (not the jokes, the technical answers below it.) – mario Jun 16 '11 at 10:00
@mario: today do you have something against me or you missed the other regex answers? I can understand why. `Also if you were to read the title once - it's about nested tags` I don't know what questions you are referring but this question with this title isn't at atll about nested tags. Now can we stop the ot? – dynamic Jun 16 '11 at 10:02
Nope. I was mostly expressing my amusement, and took the opportunity to post a note about the retarded link. – mario Jun 16 '11 at 10:08
how can you say that's retarded while it's the most upvoted answer in the history? That's a true piece of poetry imo. – dynamic Jun 16 '11 at 10:25
Can't get the WordPress filter working properly, too much hassle. Writing my own blog software using ruby ;-) @yes123 thanks for your help. – Ariejan Jun 16 '11 at 11:35
Hi guys, please don't use the commenting system as a chat room. It is for leaving a few comments and prods for more information to a question or answer, not for long debates. The reason behind this is that most of the time (and this is one of them), a lot if not all the comments belong as edits to the question/answer to make that more complete. If I have to read a half-page answer + 3 pages of comments, the focus on the comments is too big. Please edit in pertinent details into the answer instead. If you really need to chat, find/create a chat-room on the Chat site, link at the top of the page – Lasse V. Karlsen Jun 16 '11 at 11:47

score 1 · Answer 2 · answered Jun 15 '11 at 22:57

1

You answered most of your question in the steps you gave. Break it down into those chunks -- FIRST see if you have :::something, THEN update your <pre> tag and REPEAT.

You'll have a much easier time of it if you use the DOM instead of regex. It will make the job of navigating through the <pre> and <code> tags very simple. As has been said many, many times here, html is not a regular language, so a regular expression cannot parse it correctly. Even for a limited subset of HTML, it's really not the right tool. The regex for :::something is trivial once you use the DOM to get the text between <code> and </code>: /:::(\w+)/

answered Jun 15 '11 at 22:57

Greg Jackson

118
1
10

I don't mean to come across as unwilling to answer what you asked -- I know how frustrating that can be. However, doing what you want is pretty much impossible in one regex (if you include things like having a default value), even smaller parts of it get ugly and unreadable quickly. Complex regexes are prone to typos and errors. I had to learn this the hard way... – Greg Jackson Jun 15 '11 at 23:12
My voting limit has reached, I think you make a valid point here about the limitation of regex'es for the purpose as well as for the hints for the regex formulation itself. So literally +1 and I keep the tab open in the browser. – hakre Jun 15 '11 at 23:20

score 1 · Answer 3 · answered Jun 15 '11 at 23:05

1

First of all some points I ran over:

/<pre.*>\s*<code>\s*:::(\w)/
     ^

According to your question, there never is a space in there if you make use of :::something. But you add it into your regex. I wonder why.

/<pre.*>\s*<code>\s*:::(\w)/
                         ^

If the language specifier is larger than one character (which I assume) you must write that into the regex, like \w+ for one or more letters.

The rest looks quite like you have already everything. Probably not the replacement:

$result = preg_replace( '((<pre)(>\s*<code>\s*):::(\w+))', '$1 lang="$3"$2' , $subject );

Hopefully this helps.

answered Jun 15 '11 at 23:05

hakre

193,403
52
435
836

was your preg_replace syntax a simple copy/paste of my first answer? – dynamic Jun 15 '11 at 23:08
no, otherwise it would have had the mistakes you made in there. so should have been easy for you to see already as you corrected them based on my posting (at least obviously for the part you actually understood) - just kidding yes123 :D – hakre Jun 15 '11 at 23:13
actually the .* is necessary because the original html could have some other attr there (
```
). For this reason i left that there
```
– dynamic Jun 15 '11 at 23:25
well your regex actually matches tags like ` – hakre Jun 15 '11 at 23:27
yes I can fix it but there are no reasons since `
– dynamic Jun 15 '11 at 23:28

Regex match and replace advanced

3 Answers3

Finding :::something

Finding code without :::something

Fixing <code>

Fixing `<code>`