PHP Regexp capturing repeating group of chars, e.g. hahaha jajajaja hihihi

Question

As title, is there a way in PHP, with preg_match_all to catch all the repetitions of chars group? For instante catch

hahahaha

jajajaj

hihihi

It's fine to catch repetition of any char, like abababab, acacacacac. Also, is there a way to count the number of repetition?

The idea is to catch all this "forms" of smiling on social media. I figured out that there are also other cases, such as misspelled instances like ahahhahaah (where you have two consecutive a or h). Any ideas?

I would start with `/(??){2,}/` and then do further processing in PHP. That said, it may well be possible to do it entirely in a regex (i.e. to check the two chars in the pattern are not the same). — halfer, Jul 14 '14 at 23:11
I don't fully understand what you need. You can write a regex with the group you want and count the matches. Check it here http://regex101.com/r/tS9eP6/1 for this example for `ha`you have 4 matches — Federico Piazza, Jul 14 '14 at 23:13
@Fede The string `hahaasdfhaha` wouldn't seperate the two `haha`s, and would instead print it as `4` repetitions. — Max, Jul 15 '14 at 00:14
@Max so, check http://regex101.com/r/tS9eP6/2 is that valid? — Federico Piazza, Jul 15 '14 at 00:47
@Fede Oops, that was fully my fault - I missed how you used a `+` (even though I should've thought for a bit longer, and realized that such an addition isn't that complicated). Anyhow, yours works fine then, as long as he knows exactly which two characters to match. EDIT: Wait, just checked - did you mean `(ha)+` instead of `(ha+)`? I think that's what confused me a bit — Max, Jul 15 '14 at 00:50
@Max lol, acutally I said (ha+). You can check both in the link above on my comment. But I still don't understand this question. A regex to match all those 3 options can be `(ha+|ja+|hi+)` — Federico Piazza, Jul 15 '14 at 01:00
@Fede `(ha+)` matches `haaaa` and `haaaaaaaa`, while `(ha)+` matches `hahaha` and `hahahahaha`, so `(ha+)` (what you wrote) doesn't really make that much sense in this case. I think I understood what he meant (check my answer to see what I mean), and if that's the case, `(ha)+` works as long as the repeated letters are `h` and `a`, but not for anything else. `(ha+)` works for matching one or more "a":s after an "h", but doesn't catch nor group the repetitions in any way. — Max, Jul 15 '14 at 01:05
Hey Mauro, did any of the answers help, or are you still having problems with it? If so pls let us know so we can tweak. :) — zx81, Jul 16 '14 at 22:24
Thank you guys. The idea was to catch all this "forms" of smiling on social media. But now I figured out that there is still some missing cases, for instances misspelled ones like ahahhahaah (where you have two consecutive a or h in the middle of the previous pattern). Any ideas? — Mauro, Aug 04 '14 at 14:12

Max · Answer 1 · 2014-07-15T00:22:35.083

How about this:

preg_match_all('/((?i)[a-z])((?i)[a-z])(\1\2)+/', $str, $m);
$matches = $m[0]; //$matches will contain an array of matches

A bit complicated, but it does work. To explain, the first subpattern (((?i)[a-z])) matches any character between a and z, no matter the case. The second subpattern (((?i)[a-z])) does the same thing. The third subpattern ((\1\2)+) matches one or more repetitions of the first two letters, in the same case as they were originally put. This regular expression also assumes that there's an even number of repetitions. If you don't want that, you can add \1? at the end, meaning that (as long as it contains one or more repetitions), it can end with the first character (for instance, hahah and ikikikik would both be valid, but not asa).

To retrieve the number of repetitions for a specific match, you can do:

$numb = strlen($matches[$index])/2 - 1; //-1 because the first two letters aren't repetitions

score 1 · Answer 2 · edited May 23 '17 at 12:11

1

For the shortest repetition (e.g. ha gets repeated multiple times in hahahaha):

(.+?)\1+

See demo.

For the longest repetition (e.g. haha gets repeated in hahahaha):

(.+)\1+

Counting Repetitions

The non-regex solution is to compare the lengths of Group 1 (the repteated token) and the overall match.

With pure regex, in .NET, you could simply do (.+?)(\1)+ and look at the number of captures in the Group 1 CaptureCollection object.

In PHP, that's not possible, but there are some hacks. See, for instance, this question about matching a line number—it's the same technique. This is for "study purposes" only—you wouldn't want to use that in real life.

edited May 23 '17 at 12:11

Community

1
1

answered Jul 15 '14 at 01:08

zx81

41,100
9
89
105

This is a very good technique. As always it's a pleasure learning from you – Federico Piazza Jul 15 '14 at 01:16
Good technique for general repetitions, as long as you keep in mind that it also matches `asdfasdf` and `hellohello`, not just character-repetitions. – Max Jul 15 '14 at 11:01

PHP Regexp capturing repeating group of chars, e.g. hahaha jajajaja hihihi

2 Answers2