0

I'm in the process of porting some php code I have to nodejs.
The issue I have concerns this PCRE regex:

/\/?_?[0-9]*_?([^\/\._]*)[_#*\-*\.?\p{L}\p{M}*]*$/u

(this regex matches first in _4_first_ääää,in _first_äääää or first_äääää)

I'm using XRegExp in this context, but with no luck:

// lib/parser.js
var XRegExp = require('xregexp').XRegExp;

module.exports = { 
  getName : function(string){
    var name = XRegExp('\/?_?[0-9]*_?([^\/\._]*)[_#*\-*\.?\p{L}\p{M}*]*$');
    var matches = XRegExp.exec(string, name);
    if(matches && matches.length > 0){
      return matches[1];
    }
    else{
      return '';
    }
  }
};

And the test (mocha) that goes with it:

// test/test.js
var assert = require("assert");
var parser = require('../lib/parser.js');
describe('parser', function(){
  describe('#getName()', function(){
    it('should return the name contained in the string', function(){
      assert.equal('test', parser.getName('3_test'));
      assert.equal('test', parser.getName('test'));
      assert.equal('test', parser.getName('_3_test'));
      assert.equal('test', parser.getName('_3_test_ääää'));
      assert.equal('test', parser.getName('_3_test_boom'));
    })
  })
})

And the tests results:

0 passing (5ms)
1 failing

1) parser #getName() should return the name contained in the string:

  AssertionError: "test" == "ääää"
  + expected - actual

  +ääää
  -test

This code matches ääää.
The commented line catches first so I guess I'm missusing the unicodes caracter classes.

My question is: how can I make my original php regex work in javascript?

Mmaybe there is a work around?

m.pons
  • 140
  • 1
  • 10
  • there's no such thing as "php regex". You probably mean a `perl compatible regular expression` (or `preg`). I must admit I'm not at all familiair with XRegExp, so can't really help you with that. You'd need to look up the differences in parsing between the two. – Tularis Mar 26 '14 at 17:23
  • 2
    @tularis - usually people follow up unnecessary clarifications with something useful. – Anthony Mar 26 '14 at 17:29
  • Disclaimer: I don't know XRegExp, so I could be wrong. But you seem to be missing the delimiters in your first statement: `XRegExp('\`. You probably want something like `XRegExp('@/?_?...*$@')` instead. – Amal Murali Mar 26 '14 at 17:29
  • @Tularis You're right, sorry for the confusion. I've been searching the XregExp website for answers in parallel http://xregexp.com/syntax/#unicode, it seems to be supported – m.pons Mar 26 '14 at 17:29
  • @AmalMurali Thanks but it seems XregExp doesn't need them: http://xregexp.com/syntax/ – m.pons Mar 26 '14 at 17:31
  • Are you loading the [Unicode add-ons](http://xregexp.com/plugins/#unicode) – Anthony Mar 26 '14 at 17:37
  • 1
    @Anthony it seems that this is just for browsers: http://xregexp.com/plugins/#unicode. In nodejs, if I understood correctly, it's all included, the same way it is in `xregexp-all.js` – m.pons Mar 26 '14 at 17:41
  • 1
    Would it be a serious undertaking to add the one line of code to confirm? – Anthony Mar 26 '14 at 17:44
  • @Anthony In the module documentation the author states that it is for browsers only https://www.npmjs.org/package/xregexp. So I assume it's all in there for the nodejs module. Maybe you have an idea on how to anyway specifically add it in node? I have no way to look at the module code at the moment, to make sure it's there – m.pons Mar 26 '14 at 18:04
  • The answer to your problem lies, already answered, [here](http://stackoverflow.com/questions/280712/javascript-unicode). – tenub Mar 26 '14 at 18:09
  • @tenub I've checked this thread, that's how I ended up using XregExp instead of `match()` Maybe you're pointing at something else in particular? – m.pons Mar 26 '14 at 18:18
  • 1
    @m.pons Javascript does not support unicode character classes. http://inimino.org/~inimino/blog/javascript_cset might help in devising the equivalent in javascript – Ron Rosenfeld Mar 26 '14 at 19:50
  • @RonRosenfeld Thanks a lot for the link, if I don't manage to make it work the original way, I'll definitely turn to that – m.pons Mar 27 '14 at 21:39

2 Answers2

0

Put an anchor at the begining:

^\/?_?[0-9]*_?([^\/\._]*)[_#*\-*\.?\p{L}\p{M}*]*$

Also you could remove the unnecessary escaping:

^/?_?[0-9]*_?([^/._]*)[-_#*.?\p{L}\p{M}]*$

Your regex matches also an empty string, may be you want:

^/?_?[0-9]*_?([^/._]+)[-_#*.?\p{L}\p{M}]+$

According to your sample, id could be:

^/?(?:(?:_\d+)?_)?([^/._]+)[-_#*.?\p{L}\p{M}]+$
Toto
  • 89,455
  • 62
  • 89
  • 125
  • Thanks for your answer I will try it as soon as I'm not on mobile – m.pons Mar 26 '14 at 18:14
  • Thanks for improving my query, unfortunately when I switch my regex with one of these, I get a null `matches` variable. Just the fact of adding the `^` at the beginning gives me a null `matches`...?! – m.pons Mar 26 '14 at 20:39
  • @m.pons:It's certainly because you have some others character before. Could you show a real string that you have to check? – Toto Mar 27 '14 at 12:01
0

I finally managed to find the origin of the problem. The \p{L} and \p{M} need another backslash in the Xregexp syntax. That change made the original regex work again.

var unicodeWord = XRegExp('^\\p{L}+$');
unicodeWord.test('Русский'); // -> true
unicodeWord.test('日本語'); // -> true
unicodeWord.test('العربية'); // -> true

from the usage examples: https://github.com/slevithan/xregexp/blob/master/README.md#usage-examples

m.pons
  • 140
  • 1
  • 10