2

Hi my question is simple:

I want to match all the possible hashtags in an article only if they are in a <figcaption> with PCRE regex. E.g:

<figcaption>blah blah #hashtag1, #hashtag2</figcaption>

I made an attempt here https://regex101.com/r/aL9vS8/1 and removing the last ? would change the capture from #hashtag1 to #hashtag2 but can't get both.

I am not even sure it is doable in one single regex in PHP.

Any idea to help me? :)

If there is no way in one single regex (really? even working with recursion (?R)?? :p), please suggest the most efficient way possible performance wise.

Thank you!

[EDIT]

If there is no way, my PHP next idea is to:

  1. Match every figcaption with preg_replace_callback
  2. In the callback match every instance of #hashtag.

Can I get your opinions on this? Is there a better way? my articles are not very long.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
antoni
  • 5,001
  • 1
  • 35
  • 44
  • figcaption is a html tag. You can use JS to get the text in figcaption, then start the search in finding the hashtags using regex. – rmondesilva Jul 20 '16 at 07:40
  • 1
    Possible duplicate of [How to capture an arbitrary number of groups in JavaScript Regexp?](http://stackoverflow.com/questions/3537878/how-to-capture-an-arbitrary-number-of-groups-in-javascript-regexp) – Thomas Ayoub Jul 20 '16 at 07:42
  • The point here is that there is no need to match "arbitrary number of groups", this question is not a dupe of the above. Actually, JS tag should be removed, the attempt shared was a PCRE regex. – Wiktor Stribiżew Jul 20 '16 at 07:55
  • Again, this has not been related to JS, I removed any mentioning of JS in the question. – Wiktor Stribiżew Jul 20 '16 at 09:00

1 Answers1

2

Please suggest the most efficient way possible performance wise

The most reliable way to match some text in between some delimiters with PCRE regex is by using the custom boundaries with \G operator. However, the trailing boundary is a multicharacter string, and to match any text but the </figcaption> you'd need a tempered greedy token. Since this token is very resource consuming, it must be unrolled.

Here is a fast, reliable PCRE regex for your task:

(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+

See the regex demo

Details:

  • (?:<figcaption|(?!^)\G) - Matches <figcaption or the end of the previous successful match
    More details:
    (?:<figcaption|(?!^)\G) is a non-capturing group ((?:...))that is meant to only group, not keep track of what was matched with this group (i.e. no value is kept in the group stack since the stack is not created) that matches 2 alternatives (| is an alternation operator): 1) literal text <figcaption or 2) (?!^)\G - a location after the previous successful match (note that \G also matches the start of the string, thus, we must add the negative lookahead (?!^) to exclude that behavior).
  • [^<#]* - 0+ chars other than < and #
  • (?:(?:<(?!\/figcaption>)|#\B)[^<#]*)* - 0+ sequences of:
    • (?:<(?!\/figcaption>)|#\B) - a < not followed with /figcaption> or # not followed with a word char
    • [^<#]* - 0+ chars other than < and #
  • \K - omit the text matched so far
  • #\w+ - # and 1+ word chars

Even more details:

The escape sequence \K causes any previously matched characters not to be included in the final matched sequence. For example, the pattern:

foo\Kbar

matches foobar, but reports that it has matched bar. This feature is similar to a lookbehind assertion.

  • (?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*: Here, we have an outer non-capturing group (?:...)* to enable matching a sequence of subpatterns zero or more times (we can set a quantifier * only to a grouping if we need to repeat a sequence of subpatterns) and the inner non-capturing group (?:<(?!\/figcaption>)|#\B)[^<#]* is just a way to shrink a longer <(?!\/figcaption>)[^<#]*|#\B[^<#]* (just to group 2 different alternatives <(?!\/figcaption>) and #\B before a common "suffix" [^<#]*.
  • Wrapping in a tag: just use preg_replace with the <span class="highlight">$0</span> replacement pattern:

Code:

$re = '~(?:<figcaption|(?!^)\G)[^<#]*(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*\K#\w+~'; 
$str = "<figcaption>blah # blah #hashtag1, #hashtag2</figcaption> #ee <figcaption>#ddddd"; 
$subst = "<span class=\"highlight\">$0</span>"; 
$result = preg_replace($re, $subst, $str);
echo $result;

See the PHP IDEONE demo

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Waw thanks so much! It sounds awesome and working... Can you please help me better understand your regex step by step and help me capture the hashtag for later replacement? – antoni Jul 20 '16 at 07:48
  • You do not need to *capture* hashtag, it is *matched* this way. Even if the tag is broken, this will always match hashtags after the opening `
    – Wiktor Stribiżew Jul 20 '16 at 07:49
  • Waw trying on your example I can make the substitutions I want. Perfect! Awesome! Just wish you could split your pattern into more explanations, it's gonna take a while to understand haha – antoni Jul 20 '16 at 07:54
  • Please let me know which parts of the regex are unclear, I will add more details to the answer. – Wiktor Stribiżew Jul 20 '16 at 07:56
  • I am reading your docs, I am already confuse with atomic groups.. If you can help me here: http://stackoverflow.com/questions/38476177/regex-atomic-group-purpose :) – antoni Jul 20 '16 at 08:30
  • **I do not have an atomic group in the pattern above**. Which parts of the regex do you need explanation? I appreciate your desire to study yourself, just let me know what you need to learn. – Wiktor Stribiżew Jul 20 '16 at 08:32
  • I know but trying to understand your pattern following your docs I was already stuck with previous step atomic group lol. from the link you gave http://stackoverflow.com/a/37343088/2012407 i went to http://www.rexegg.com/regex-quantifiers.html#tempered_greed and a bit upper there is the atomic group – antoni Jul 20 '16 at 08:34
  • Check ***More details*** I added after the first subpattern. Should I add similar explanations after each? – Wiktor Stribiżew Jul 20 '16 at 08:39
  • Thanks for more details. I understand lookarounds, grouping, capturing, modifiers, `\B`, `[chars]`, and `[^chars]`. but can you detail more on `\K` and why double grouping `(?:(?:<(?!\/figcaption>)|#\B)[^<#]*)*` please? also I do need to **capture** hashtags because I need to wrap in span. – antoni Jul 20 '16 at 08:49
  • 1
    See my update. As I said, there is **no need to capture** at all since the match is a 0th group already that you can backreference with `$0`. – Wiktor Stribiżew Jul 20 '16 at 08:58
  • amazing. `$0` of course! Thanks so much for your expertise and sharing! – antoni Jul 20 '16 at 09:09
  • @ wiktor: another challenge for you :) I wanted to help there: http://stackoverflow.com/questions/38288803/textmate-regex-find-word-based-on-a-pattern-that-only-exists-in-declaration#comment63996596_38288803 but still can't come out with a solution. As I think it is a similar problem can you illuminate us? In the meantime thanks to u I learnt `\G`, `\K`, tempered greedy token, atomic group and unrolling method! wow a lot! I still have to play with those before being confident! – antoni Jul 20 '16 at 15:47