0

As title, is there a way in PHP, with preg_match_all to catch all the repetitions of chars group? For instante catch

  1. hahahaha
  2. jajajaj
  3. hihihi

It's fine to catch repetition of any char, like abababab, acacacacac. Also, is there a way to count the number of repetition?

The idea is to catch all this "forms" of smiling on social media. I figured out that there are also other cases, such as misspelled instances like ahahhahaah (where you have two consecutive a or h). Any ideas?

Mauro
  • 361
  • 1
  • 4
  • 14
  • I would start with `/(??){2,}/` and then do further processing in PHP. That said, it may well be possible to do it entirely in a regex (i.e. to check the two chars in the pattern are not the same). – halfer Jul 14 '14 at 23:11
  • I don't fully understand what you need. You can write a regex with the group you want and count the matches. Check it here http://regex101.com/r/tS9eP6/1 for this example for `ha`you have 4 matches – Federico Piazza Jul 14 '14 at 23:13
  • What would be your desired output? – hwnd Jul 14 '14 at 23:17
  • @Fede The string `hahaasdfhaha` wouldn't seperate the two `haha`s, and would instead print it as `4` repetitions. – Max Jul 15 '14 at 00:14
  • @Max so, check http://regex101.com/r/tS9eP6/2 is that valid? – Federico Piazza Jul 15 '14 at 00:47
  • @Fede Oops, that was fully my fault - I missed how you used a `+` (even though I should've thought for a bit longer, and realized that such an addition isn't that complicated). Anyhow, yours works fine then, as long as he knows exactly which two characters to match. EDIT: Wait, just checked - did you mean `(ha)+` instead of `(ha+)`? I think that's what confused me a bit – Max Jul 15 '14 at 00:50
  • @Max lol, acutally I said (ha+). You can check both in the link above on my comment. But I still don't understand this question. A regex to match all those 3 options can be `(ha+|ja+|hi+)` – Federico Piazza Jul 15 '14 at 01:00
  • @Fede `(ha+)` matches `haaaa` and `haaaaaaaa`, while `(ha)+` matches `hahaha` and `hahahahaha`, so `(ha+)` (what you wrote) doesn't really make that much sense in this case. I think I understood what he meant (check my answer to see what I mean), and if that's the case, `(ha)+` works as long as the repeated letters are `h` and `a`, but not for anything else. `(ha+)` works for matching one or more "a":s after an "h", but doesn't catch nor group the repetitions in any way. – Max Jul 15 '14 at 01:05
  • @Max oh yes, sorry I messed my head... you were right – Federico Piazza Jul 15 '14 at 01:11
  • Hey Mauro, did any of the answers help, or are you still having problems with it? If so pls let us know so we can tweak. :) – zx81 Jul 16 '14 at 22:24
  • Thank you guys. The idea was to catch all this "forms" of smiling on social media. But now I figured out that there is still some missing cases, for instances misspelled ones like ahahhahaah (where you have two consecutive a or h in the middle of the previous pattern). Any ideas? – Mauro Aug 04 '14 at 14:12

2 Answers2

2

How about this:

preg_match_all('/((?i)[a-z])((?i)[a-z])(\1\2)+/', $str, $m);
$matches = $m[0]; //$matches will contain an array of matches

A bit complicated, but it does work. To explain, the first subpattern (((?i)[a-z])) matches any character between a and z, no matter the case. The second subpattern (((?i)[a-z])) does the same thing. The third subpattern ((\1\2)+) matches one or more repetitions of the first two letters, in the same case as they were originally put. This regular expression also assumes that there's an even number of repetitions. If you don't want that, you can add \1? at the end, meaning that (as long as it contains one or more repetitions), it can end with the first character (for instance, hahah and ikikikik would both be valid, but not asa).

To retrieve the number of repetitions for a specific match, you can do:

$numb = strlen($matches[$index])/2 - 1; //-1 because the first two letters aren't repetitions
Max
  • 897
  • 1
  • 10
  • 27
1

For the shortest repetition (e.g. ha gets repeated multiple times in hahahaha):

(.+?)\1+

See demo.

For the longest repetition (e.g. haha gets repeated in hahahaha):

(.+)\1+

Counting Repetitions

The non-regex solution is to compare the lengths of Group 1 (the repteated token) and the overall match.

With pure regex, in .NET, you could simply do (.+?)(\1)+ and look at the number of captures in the Group 1 CaptureCollection object.

In PHP, that's not possible, but there are some hacks. See, for instance, this question about matching a line number—it's the same technique. This is for "study purposes" only—you wouldn't want to use that in real life.

Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105
  • This is a very good technique. As always it's a pleasure learning from you – Federico Piazza Jul 15 '14 at 01:16
  • Good technique for general repetitions, as long as you keep in mind that it also matches `asdfasdf` and `hellohello`, not just character-repetitions. – Max Jul 15 '14 at 11:01