-2

Is there any way or any library out there that can compute a JS RegEx from a set of strings that I want to be matched?

For example, I have this set of strings:

  • abc123
  • abc212

And generate abc\d\d\d ?

Or this set:

  • aba111
  • abb111
  • abc

And generate ab. ?

Note that I don't need a very precise RegEx, I just want one that can do strings, . and .*

Anderson Green
  • 30,230
  • 67
  • 195
  • 328
  • Why `abc\d\d\d` and not `abc(?:123|212)`? – Mariano Sep 15 '15 at 03:54
  • Check [text2re](http://www.txt2re.com/) as suggested in http://stackoverflow.com/questions/6219790/need-a-regex-tool-that-suggests-expressions-based-on-selected-text – Mariano Sep 15 '15 at 03:59

2 Answers2

2

Not without producing all the possible outcomes of a certain Grammar, some of which are infinite. This means it's not possible in the general case for finding a specific wanted grammar from a given input set. Even in your cases, you need to give every possible production of the Grammar (regular expression) in order to know exactly what regular expression you are happening to look for. For example the first set, there are several regular expressions that can match it, some of which could be:

abc[0-9][0-9][0-9]
abc[1-2][0-5][2-3]
abc[1-2][0-5][2-3]a*
abc\d*
abc\d+
abc\d+a*b*c*
...

And so on. That being said you could find a grammar that happens to match that sets conditions. One way is to simply brute-force the similarities and differences of each input item. So to do this with the second example:

  • aba111
  • abb111
  • abc

The ab part is the same for all of them so we start with ab as the regexp. Then the next character can be a, b or c so we can say (a|b|c). Then 1 or empty three times. That would result in:

ab(a|b|c)(1|)(1|)(1|)

Which is a correct regular expression, but maybe not the one you wanted.

Spencer Wieczorek
  • 21,229
  • 7
  • 44
  • 54
  • `ab(a|b|c)(1|)(1|)(1|)` matches "`abc11`", not included in the list :) – Mariano Sep 15 '15 at 04:06
  • @Mariano The point is to match the elements in the set (not to match them and them only), it doesn't really matter if it happens to match other things. The string `"aba11"` is a sub-string of `"aba111"`, so it's not really surprising it also matches. That just happens to be the result of the approach I've mentioned since it's done character by character. – Spencer Wieczorek Sep 15 '15 at 04:11
  • I clarified the question, I don't need too much complexity. – Rui Nelson Magalhães Carneiro Sep 15 '15 at 04:38
  • @SpencerWieczorek: If the point is to match the elements in the set, and it doesn't matter it it matches oher things, use an empty pattern, it will most certainly match them. As you said, it can't be done "without producing all the possible outcomes of a certain Grammar". That's why [text2re](http://www.txt2re.com/) requires user to select the constructs. If you intended to **generalize** text extraction by a genetic algorithm, then input of failure cases are most certainly needed for a reazonable outcome, as in [Regex Golf](http://regex.inginf.units.it/golf/) – Mariano Sep 15 '15 at 04:46
  • 2
    @Mariano I see your point, I guess my point is rather have an expression that has meaning along with not being to *"crude"*. For example, for the second input items you could simply have `(aba111|abb111|abc)` which works fine and only matches those items but isn't really "helpful". – Spencer Wieczorek Sep 15 '15 at 04:51
  • Of course, but then again it has a particular *meaning* to the user. I believe generalization by GA or golf falls outside the scope of this question. Just to clarify, I believe you gave a good answer and in my first comment I was simply trying to emphasize its complexity with a bit of humour. – Mariano Sep 15 '15 at 05:10
0

May be this is too simple but you can use this,

var arr = ['abc121','abc212','cem23'];
var regex_arr = [];

arr.sort(function(a, b){return -a.length+b.length;});
for(var i in arr[0]){
    for(var j in arr){
        if(i>=arr[j].length){
            regex_arr[i] = {value:'',reg:'*',use_self:false};
        }else{
            var c = arr[j][i];
            var current_r = '.';

            if(isNaN(c)){
                if(/^[A-Za-z]$/.test(c)){
                    current_r = '\\w';
                }else{
                    current_r = '\\W';
                }
                //... may be more control
            }else{
                current_r = '\\d';
            }
            if(!regex_arr[i]){
                regex_arr[i] = {value:c,reg:current_r,use_self:true};
            }else{
                if(regex_arr[i].value!=c){
                    if(regex_arr[i].reg!=current_r){
                        regex_arr[i].reg = '.';
                    }
                    regex_arr[i].use_self = false;
                    regex_arr[i].value = c;
                }
            }
        }
    }
}
var result = '';
for(var i in regex_arr){
    if(regex_arr[i].use_self){
        result += regex_arr[i].value;
    }else{
        result += regex_arr[i].reg;
    }
    if(regex_arr[i].reg=='*'){
        break;
    }
}
console.log("regex = "+result);
for(var i in arr){
    var r = new RegExp(result);
    console.log(arr[i] + ' = '+r.test(arr[i]));
}

Results

regex = \w\w\w\d\d*
abc121 = true
abc212 = true
cem23 = true
Cem Yıldız
  • 122
  • 1
  • 5