2

Let's say I have random block of text:

EAMoAAQAABwEBAAAAAAAAAAAAAAABAgMFBgcIBAkBAQABBQEBAAAAAAAAAAAAAAAGAgMEBQcBCBAAAQMDAgMEBQcIBQgGCwEAAQACAxEEBSEGMRIHQVFhE3GBIhQIkaGxwTJCI9FScoKSojMV8GLCUxbhstKDo7M0ZHOTJEQlF/HiQ2PDVHSExEUmGBEBAAIBAgMDCAgCCgMBAQEAAAECAxEEITEFQRIGUWFxgZGhIhPwscHRMlIUB0Jy4fGCkqLCI1MVFrLSQ2IzF//aAAwDAQACEQMRAD8A7+QEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEEDwXkzpxHgusxi7NrnXF3G0NBLhzAkAeAqVH934r6bt57uTPSJ8ne1n2Rqycezy35VlRttwYu5DXNlLOcczOdpHM3hUUqtLs/wBxulZonXJ8vjp8caa+eOa5k6flrPLVcIbm3n/gytf4NcCVKtj1XbbqNcOSuT+W0W+pi3x2rzjRWWxUCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAggV5It2Uy8GNYAWmW6kr5MDftO8T3BRXxR4s2/SccTb48lvw0jnPnn8tfP6o1Ze02ds08OERzlid+/P5Orp5BHEeFuxxa0Dxpx9a+fOu+Iup9Tmfm30p+Ss92vr/N6bat/t67fDyjWfLLG79pt45YpAA8NdUAg9ngolTFNbedtqWi0avVicv5bLKFr2kSRltHaahrXCnylZcd6k208rDy4ItxlkUr5+XnZE1zxq0h3KfUQqv1GWsxeI0tHKY1rPtjRgVivKZU7HebrS491ybX+TWnO7V7PEn7w+f0rpPhb9zdxt7Rj3szkx/n/AI6+n88f4vTyebno8Wr3qTGvun7mawSxzsbNC4Pje0Oa9pqCD2grv+3z0zUi9Ji1bRrEx2wjtqzWdJ5wqq8pEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQU

SPECIFICATIONS:

patternABC >= 2 characters = groupABC IF groupABC occurs more than once
groupABC + (groupABC)n = sequence WHERE n >= 1 AND sequence > 6 characters

** A sequence needs to be > 6 characters in order to be evaluated

BREAKDOWN:

How do I find any repeating patterns that occur in sequence?

QEBAQEBAQEBAQEBAQEBAQEBA

I also want to count how many times each group repeats:

QEBA QEBA QEBA QEBA QEBA QEBA = 6

Also the sequence must be > 6 characters in order to be evaluated:

NO GOOD: AA AA AA
GOOD: AA AA AA AA

It would be ideal if the output could be stored in an associative array, with duplicate entries removed:

QEBA => 6, AA => 4, QEBA => 3, AA => 8, (QEBA => 6)<- REMOVE

Does anyone have the time & the inclination to tackle this problem? You rock if you do!

abbotto
  • 4,259
  • 2
  • 21
  • 20
  • 4
    probable duplicate of [Regular Expression to detect repetition within a string](http://stackoverflow.com/q/943872), second answer. – mario Mar 10 '13 at 18:17
  • this regex: `/((\w+?)(\2))+/g` will match all repeating sequences of letters. Then I guess you need to manipulate them a bit. – Billy Moon Mar 10 '13 at 18:20
  • Thanks for quick replies! @BillyMoon That regex looks like it could definitely help. Thanks. – abbotto Mar 10 '13 at 18:26
  • 2
    "in groups of 2 or more"? Repeated 2 or more times? Or at least 2 characters to be repeated? What about `AAAAAABAAAAAAB`? Are there `AA`s, `AAAAAAB`s, or both repeating? What about `ABABABCBCBC`, should `AB`, `BC` or both be counted? – Qtax Mar 10 '13 at 18:30
  • patternABC >= 2 characters = groupABC groupABC + (groupABC)n = sequence WHERE n >= 1 AND sequence > 6 characters ** A sequence needs to be > 6 characters in order to be evaluated I hope that clarifies things. – abbotto Mar 10 '13 at 19:28
  • Just modified the explanation of the question to clarify things a bit more. – abbotto Mar 10 '13 at 19:35

2 Answers2

3
$str = 'EAMoAAQAABwEBAAAAAAAAAAAAAAABAgMFBgcIBAkBAQABBQEBAAAAAAAAAAAAAAAGAgMEBQcBCBAAAQMDAgMEBQcIBQgGCwEAAQACAxEEBSEGMRIHQVFhE3GBIhQIkaGxwTJCI9FScoKSojMV8GLCUxbhstKDo7M0ZHOTJEQlF/HiQ2PDVHSExEUmGBEBAAIBAgMDCAgCCgMBAQEAAAECAxEEITEFQRIGUWFxgZGhIhPwscHRMlIUB0Jy4fGCkqLCI1MVFrLSQ2IzF//aAAwDAQACEQMRAD8A7+QEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEEDwXkzpxHgusxi7NrnXF3G0NBLhzAkAeAqVH934r6bt57uTPSJ8ne1n2Rqycezy35VlRttwYu5DXNlLOcczOdpHM3hUUqtLs/wBxulZonXJ8vjp8caa+eOa5k6flrPLVcIbm3n/gytf4NcCVKtj1XbbqNcOSuT+W0W+pi3x2rzjRWWxUCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAggV5It2Uy8GNYAWmW6kr5MDftO8T3BRXxR4s2/SccTb48lvw0jnPnn8tfP6o1Ze02ds08OERzlid+/P5Orp5BHEeFuxxa0Dxpx9a+fOu+Iup9Tmfm30p+Ss92vr/N6bat/t67fDyjWfLLG79pt45YpAA8NdUAg9ngolTFNbedtqWi0avVicv5bLKFr2kSRltHaahrXCnylZcd6k208rDy4ItxlkUr5+XnZE1zxq0h3KfUQqv1GWsxeI0tHKY1rPtjRgVivKZU7HebrS491ybX+TWnO7V7PEn7w+f0rpPhb9zdxt7Rj3szkx/n/AI6+n88f4vTyebno8Wr3qTGvun7mawSxzsbNC4Pje0Oa9pqCD2grv+3z0zUi9Ji1bRrEx2wjtqzWdJ5wqq8pEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQU';

preg_match_all( '/(\S{2,}?)\1+/', $str, $matches );

// Remove duplicates
$matches[0] = array_unique( $matches[0] ); 

foreach ( $matches[0] as $key => $value ) {
    if ( strlen( $value ) > 6 ) {
        $repeated = $matches[1][$key];
        $results[] = array( $repeated => count( explode( $repeated, $value ) ) - 1 );
    }    
}

print_r($results); 

/*
[AA] => 7
[QEBA] => 93
[CAgI] => 18
[EBAQ] => 18
*/

The above assumes a sequence is composed of non-space characters.

MikeM
  • 13,156
  • 2
  • 34
  • 47
  • I need to know each sequence & how many groups contained within, as it occurs. So if QEBA occurs 6 times in one place & 18 times in another place, that's how I want to store it. QEBA => 6, QEBA => 18 – abbotto Mar 10 '13 at 20:22
  • @o0110o. The output is stored in an array of associative arrays, each containing a sequence as the key, and the repeat count as its value. This means the same sequence can be now stored multiple times. – MikeM Mar 10 '13 at 22:01
  • Thanks for taking the time to help out. I will let you know how it goes. – abbotto Mar 11 '13 at 00:52
  • So as it turns out, what I was trying to achieve was the similar to GZIP, so that's what I ended up using in the end. I tried these answers & I found that @MikeM gave a more robust answer that I was able to implement easily, although I really liked the regex examples provided by @ ka. Thanks again everybody! – abbotto Mar 11 '13 at 19:48
1

Get the sequences with preg_match_all('/(?:(.{6,})\1)/',$inputText,$sequences)
(note: sequences will be saved in $sequences)
Explained RegEx demo: http://regex101.com/r/rW4nE2

Use array_unique() to get rid of duplicates.

Loop through each sequence and:
Get the groups with preg_match_all('/(.+?)(\1)(\1)?/',$sequence,$groups)
Explained RegEx demo: http://regex101.com/r/pC3pB7

Use count() if you need to.

CSᵠ
  • 10,049
  • 9
  • 41
  • 64
  • you might also like [base64_decode()](http://www.php.net/manual/en/function.base64-decode.php) – CSᵠ Mar 10 '13 at 19:59
  • Thank-you so much! I will try this out & mark your answer as correct if it works. Also, is there anyway to get preg_match_all('/(.+?)(\1)(\1)?/',$sequence,$groups) to output the smallest possible pattern? Should I just run it again using groups[0] for input? – abbotto Mar 10 '13 at 20:09
  • Excellent, this example threw me off a little: http://regex101.com/r/pC3pB7. Again, thank-you for your help. Thanks to everyone else that is trying to help as well! – abbotto Mar 10 '13 at 20:17
  • You're welcome. The only thing that's repeating there is `CAgI`. See this one also: http://regex101.com/r/jH8dV3 – CSᵠ Mar 10 '13 at 20:23