PHP (preg) Regular Expression For Content Indexing/Update

Question

I have the following code:

/* record 863.content.en */
UPDATE language_def
SET en='<html>blah blah markup</html>'
WHERE page_id=863,
AND string_id='content';
/* record_end 863.content.en */

I would like to create an expression to match that statement where:

the data in between the periods of 863.content.en are variable BUT SPECIFIC (there will be many of these statements in a row)
the data in between the two comments is variable but NOT specific

This is what I have so far:

'[/*]\s*record\s*specific_number[.]specific_string1[.]specific_string2\s*[*/].*[/*]\s*record_end\s*specific_number[.]specific_string1[.]specific_string2\s*[*/]'

So you need the HTML tags? Or just what's in between? I assume the *specific_* are just placeholders? — Jason McCreary, Nov 20 '12 at 22:10
I need to match each specific section of /*stuff in comment*/ content /*end comment*/ based on whatever values I pass to the replacement function I'll write using preg_replace() — Miles Smith, Nov 20 '12 at 22:12
For `PREG_*` functions, you need a delimiter. Try using '#' at the beginning and end of your string. — FrankieTheKneeMan, Nov 20 '12 at 22:12
So, you want to extract from SQL statements? See [SQL parser in PHP?](http://stackoverflow.com/questions/8970499/sql-parser-in-php) -- Or for regex help: [Open source RegexBuddy alternatives](http://stackoverflow.com/questions/89718/is-there) and [Online regex testing](http://stackoverflow.com/questions/32282/regex-testing) for some helpful tools, or [RegExp.info](http://regular-expressions.info/) for a nicer tutorial. — mario, Nov 20 '12 at 22:20
The reason I need the expression is because this file is being dynamically edited by a PHP CMS. It is used to track updates between large site redesigns. — Miles Smith, Nov 20 '12 at 22:22
@MilesSmith why did you remove your own attempt? That was a very valuable part of your question. — Martin Ender, Nov 20 '12 at 22:23

Martin Ender · Accepted Answer · 2012-11-20T22:28:37.287

There are a few problems with your regex.

First of all, as FrankeTheKneeMan pointed out, you need delimiters. # is a good choice for HTML matches (the standard choice is / but that interferes with tags too often):

'#[/*]\s*record\s*specific_number[.]specific_string1[.]specific_string2\s*[*/].*[/*]\s*record_end\s*specific_number[.]specific_string1[.]specific_string2\s*[*/]#'

Now while [.] is a nice way of escaping a single character, it doesn't work the same for [/*]. This is a character class, that matches either / or *. Same for [*/]. Use this instead:

'#/[*]\s*record\s*specific_number[.]specific_string1[.]specific_string2\s*[*]/.*/[*]\s*record_end\s*specific_number[.]specific_string1[.]specific_string2\s*[*]/#'

Now .* is the remaining problem. Actually there are too, one is critical, the other might not be. The first is that . does not match line breaks by default. You can change this by using the s (singleline) modifier. The second is, that * is greedy. Should a section appear twice in the string, you would get everything from the first corresponding /* record to the last corresponding /* record_end, even if there is unrelated stuff in between. Since your records seem to be very specific, I suppose this is not the case. But still it is generally good practice, to make the quantifier ungreedy, so that it consumes as little as possible. Here is your final regex string:

'#/[*]\s*record\s*specific_number[.]specific_string1[.]specific_string2\s*[*]/.*?/[*]\s*record_end\s*specific_number[.]specific_string1[.]specific_string2\s*[*]/#s'

For your presented example, this is

'#/[*]\s*record\s*863[.]content[.]en\s*[*]/.*?/[*]\s*record_end\s*863[.]content[.]en\s*[*]/#s'

If you want to find all of these sections, then you can make 863, content and en variable, capture them (using parentheses) and use a backreference to make sure you get the corresponding record_end:

'#/[*]\s*record\s*(\d+)[.](\w+)[.](\w+)\s*[*]/.*?/[*]\s*record_end\s*\1[.]\2[.]\3\s*[*]/#s'

Thanks, man. This was a huge help I hadn't written any expressions in a while and I guess my fingers were also a little cold in this winter weather — Miles Smith, Nov 21 '12 at 08:17

score 0 · Answer 2 · answered Nov 20 '12 at 22:23

'#/\* record (\S+) \*/.*<html>(.*)</html>.*/\* record_end \1 \*/#is'

This regular expression will split your string up into individual records, as seen here. You can feel free to replace any spaces with \s*, but I left it this way for readability. \S+ matches any number of non-whitespace characters, but you can replace it with your specific strings if you like. Other wise, you can parse over the match objects returned by preg_match_all and use the first subcapture to get the specific record, and the second subcapture to get the information between the html tags. The #s are delimiters needed by php to separate the regular expressions - i for case insensitive and s to make the . match new lines.

Thanks so much for helping me out. This bug showed up at the WRONG TIME today (rearranging the office?!). I've never heard of RegExr before, either, so that should be useful. — Miles Smith, Nov 21 '12 at 08:21

PHP (preg) Regular Expression For Content Indexing/Update

2 Answers2