5

How do I make a regular expression to evaluate the following string?

TGATGCCGTCCCCTCAACTTGAGTGCTCCTAATGCGTTGC

and extract the pattern CTCCT.

The pattern must be 3 C's and 2 T's in any order.

I tried /[C | T]{5}/ but it matches CCCCT and TCCCC

Thanks in Advance.

le_m
  • 19,302
  • 9
  • 64
  • 74

3 Answers3

3

Compute all permutations of "CTCCT" and concatenate them to a regex:

CCCTT|CCTCT|CCTTC|CTCCT|CTCTC|CTTCC|TCCCT|TCCTC|TCTCC|TTCCC

This pattern can be optimized:

C(?:C(?:T(?:CT|TC)|CTT)|T(?:C(?:CT|TC)|TCC))|T(?:C(?:C(?:CT|TC)|TCC)|TCCC)

var regex = new RegExp(/C(?:C(?:T(?:CT|TC)|CTT)|T(?:C(?:CT|TC)|TCC))|T(?:C(?:C(?:CT|TC)|TCC)|TCCC)/g);

var string = "TGATGCCGTCCCCTCAACTTGAGTGCTCCTAATGCGTTGC";

console.log(regex.exec(string));

This pattern doesn't find overlapping matches, e. g. there would only be one match in CCCTTCCC.

To find overlapping matches, use lookahead:

C(?=C(?=T(?=CT|TC)|CTT)|T(?=C(?=CT|TC)|TCC))|T(?=C(?=C(?=CT|TC)|TCC)|TCCC)

var regex = new RegExp(/C(?=C(?=T(?=CT|TC)|CTT)|T(?=C(?=CT|TC)|TCC))|T(?=C(?=C(?=CT|TC)|TCC)|TCCC)/g);

var string = "CCCTTCCC";

while ((match = regex.exec(string)) != null) {
    console.log(match.index, string.substring(match.index, match.index + 5));
}

Regex can only deal with a fairly limited number of permutations. If you want to match segments of possibly arbitrary size, use a non-regex solution:

function c3t2_optimized(str) {
  var c = 0, t = 0;
  for (var i = 0; i < str.length; ++i) {
    var last = str.charAt(i);
    if (last == 'C') ++c;
    else if (last == 'T') ++t;
    if (i > 4) {
      var first = str.charAt(i - 5);
      if (first == 'C') --c;
      else if (first == 'T') --t;
    }
    if (c == 3 && t == 2) return i - 4;
  }
  return -1;
}

var string = "TGATGCCGTCCCCTCAACTTGAGTGCTCCTAATGCGTTGC";
      
console.log(c3t2_optimized(string));

Or the same as above, just as a generator stepping through all possibly overlapping matches:

function* c3t2_optimized(str) {
  var c = 0, t = 0;
  for (var i = 0; i < str.length; ++i) {
    var last = str.charAt(i);
    if (last == 'C') ++c;
    else if (last == 'T') ++t;
    if (i > 4) {
      var first = str.charAt(i - 5);
      if (first == 'C') --c;
      else if (first == 'T') --t;
    }
    if (c == 3 && t == 2) yield i - 4;
  }
}

var string = "CCCTTCCC";

for (i of c3t2_optimized(string)) {
  console.log(i, string.substring(i, i + 5));
}

Performance comparison: https://jsfiddle.net/24qguege/7/

Firefox 47:

Community
  • 1
  • 1
le_m
  • 19,302
  • 9
  • 64
  • 74
  • I'll upvote your answer since you put effort into it and compared it against the other one. I just want to note a few things for anybody who drops in. The non-regex answer is far better if you aren't calling this thousands of times. My tests were half your times (34 and 4955), so it took 0.05 milliseconds for the non-regex answer to run each time (100k iterations). If efficiency is truly needed, js isn't efficient but maybe you need it, and the optimized regex is far better because at its core it is just a deterministic finite state machine (it should be. haven't looked up the ?: characters). – Millie Smith Jun 15 '16 at 01:53
  • @MillieSmith Thanks. I added a non-regex answer which can handle arbitrarily large 'patterns'. – le_m Jun 15 '16 at 02:23
  • @le_m This is really hot man. Thanks. Do you want to see the whole problem? I have to search a string for all the possible combinations of CCCTT. – Ryan Pace Sloan Jun 15 '16 at 04:11
  • @le_m what would an algorithm to compute all the permutations look like? – Ryan Pace Sloan Jun 15 '16 at 04:13
  • @RyanPaceSloan Compute permutations: http://stackoverflow.com/a/37580979/1647737 - Keep only unique elements: http://stackoverflow.com/a/37821550/1647737 - For special cases such as CCCTT-permutations, there might even be more efficient algorithms – le_m Jun 15 '16 at 04:15
  • @RyanPaceSloan "Do you want to see the whole problem?" yes. Is it related to https://www.hackerrank.com/contests/hourrank-6/challenges/bear-and-steady-gene ? – le_m Jun 15 '16 at 04:46
  • I permutated all the variations of the string "CCCTT" and formed a regular expression with them. – Ryan Pace Sloan Jun 15 '16 at 05:24
  • http://stackoverflow.com/questions/17006808/find-all-permutations-of-a-string-in-c – Ryan Pace Sloan Jun 15 '16 at 05:44
3

This isn't the type of problem that is easily solved using Regular Expressions. It can be solved fairly straighforwardly with a simple function, however

 function c3t2(str) {
  var lowerCaseStr = str.toLowerCase();
  for (index = 0; index + 5 <= str.length; index++) {
    var substring = lowerCaseStr.substring(index, index + 5);
    var chars = substring.split("");
    if (chars.sort().join("") === "ccctt") {
      return index;
    }
  }

  return false;
}
Andrew Rueckert
  • 4,858
  • 1
  • 33
  • 44
  • I'm happy to see this answer because regex is overused. I can't upvote yet though, because it should be `index + 4 < str.length` – Millie Smith Jun 15 '16 at 01:24
  • Ah, you're right! Used `<=` instead, to emphasize the relationship with the `index + 5` below. – Andrew Rueckert Jun 15 '16 at 01:33
  • True. it's probably best to do a case insensitive comparison or just tolowercase the whole thing. – Millie Smith Jun 15 '16 at 01:37
  • 1
    Agreed. To-Lower-Cased it up front **for efficiency**. :P – Andrew Rueckert Jun 15 '16 at 01:43
  • 1
    See https://jsfiddle.net/24qguege/ for efficiency :P - perhaps you could speed this up by splitting the string beforehand and iterating over an array or just performing a simple string.search() for each unique permutation? – le_m Jun 15 '16 at 01:44
  • 1
    Haha, yeah. If I'm gunning for efficiency (https://jsfiddle.net/24qguege/3/), I can beat the regex, using some of the same tricks that the regex engine does. Readability kind of goes out the window, though, and I wanted to highlight how simple the solution could be. – Andrew Rueckert Jun 15 '16 at 02:02
  • @AndrewRueckert Looking at your fiddle https://jsfiddle.net/24qguege/3/ it seems we had the same idea^^ https://jsfiddle.net/24qguege/4/ - can you make it even faster than regex? – le_m Jun 15 '16 at 02:28
  • @AndrewRueckert Your fiddle is incomplete, right? It only matches patterns starting at indices which are multiples of 5. – le_m Jun 15 '16 at 02:47
1

There are online generators that will make a permutations list that are distinct using
letters. I used this tool here https://www.dcode.fr/permutations-generator and is descent.

Given that list I added an OR symbol | between each word which made it into a regex.

Taking this one step further I made it into a full regex trie which is what makes
all the search engines and word completion editors so fast.

There may be regex trie generators out there that can assist in this process. This is the fastest way to go. I used Regexformat 9 to create this trie and there are others. This one uses a refactor engine to compact it and make it faster.

The more distinct letters and the number of letters can add up where you need a tool
to construct these. Example: ABCDE generates 120 unique permutations. After constructing a trie it becomes:

A(?:B(?:C(?:DE|ED)|D(?:CE|EC)|E(?:CD|DC))|C(?:B(?:DE|ED)|D(?:BE|EB)|E(?:BD|DB))|D(?:C(?:BE|EB)|B(?:CE|EC)|E(?:BC|CB))|E(?:B(?:CD|DC)|C(?:BD|DB)|D(?:BC|CB)))|B(?:A(?:C(?:DE|ED)|D(?:CE|EC)|E(?:CD|DC))|C(?:A(?:DE|ED)|D(?:AE|EA)|E(?:AD|DA))|D(?:C(?:AE|EA)|A(?:CE|EC)|E(?:AC|CA))|E(?:A(?:CD|DC)|C(?:AD|DA)|D(?:AC|CA)))|C(?:A(?:B(?:DE|ED)|D(?:BE|EB)|E(?:BD|DB))|B(?:A(?:DE|ED)|D(?:AE|EA)|E(?:AD|DA))|D(?:B(?:AE|EA)|A(?:BE|EB)|E(?:AB|BA))|E(?:B(?:AD|DA)|A(?:BD|DB)|D(?:AB|BA)))|D(?:C(?:B(?:AE|EA)|A(?:BE|EB)|E(?:AB|BA))|B(?:C(?:AE|EA)|A(?:CE|EC)|E(?:AC|CA))|A(?:C(?:BE|EB)|B(?:CE|EC)|E(?:BC|CB))|E(?:A(?:BC|CB)|B(?:AC|CA)|C(?:AB|BA)))|E(?:A(?:B(?:CD|DC)|C(?:BD|DB)|D(?:BC|CB))|B(?:A(?:CD|DC)|C(?:AD|DA)|D(?:AC|CA))|C(?:B(?:AD|DA)|A(?:BD|DB)|D(?:AB|BA))|D(?:ABC|B(?:AC|CA)|C(?:AB|BA)|ACB))

which is extremely fast.

Note that there is another way to do this using regex without using a trie. It uses a little known way to count in regex.
It is a combination of a conditional (?(cond)yes|no) and a group quantity. (?:){5}

Since the problem requires 5 characters, three C and two T, using 5 capture variables as
flags, each distinct and one for each letter given a total count of letters .
A small regex can be crafted. Each time a letter is found, a capture variable flag gets set.
If no capture flags are available, the match fails.

Like this: (?:C(?(3)(?(2)(?(1)(?!)|())|())|())|T(?(5)(?(4)(?!)|())|())){5}
https://regex101.com/r/PEnzkR/1

Conditionals are supported by the JGsoft engine, Perl, PCRE, Python, and .NET. Ruby supports them starting with version 2.0. Languages such as Delphi, PHP, and R that have regex features based on PCRE also support conditionals.

Note that the performance is slower than making a full blown regex trie.
It is however limited by the number of capture groups, but any more than 10 is
highly unlikely needed.

sln
  • 2,071
  • 1
  • 3
  • 11