0

I want to match all URLs in a string and use a regular expression for this purpose. On some strings, I encountered that the RegEx matching fails with a segmentation fault. Here is a failing example: .D0.9B.D0.B5.D1.82.D0.BE.D0.BF.D0.B8.D1.81.D1.8C.C2.A0.D1.80.D1.83.D1.81.D1.81.D0.BA.D0.BE.D0.B9.C2.A0.D0.B8.D1.81.D1.82.D0.BE.D1.80.D0.B8.D0.B8_20_.D0.B8.D1.8E.D0.BD.D1.8F

The RegEx applied:

let string = ".D0.9B.D0.B5.D1.82.D0.BE.D0.BF.D0.B8.D1.81.D1.8C.C2.A0.D1.80.D1.83.D1.81.D1.81.D0.BA.D0.BE.D0.B9.C2.A0.D0.B8.D1.81.D1.82.D0.BE.D1.80.D0.B8.D0.B8_20_.D0.B8.D1.8E.D0.BD.D1.8F";
this.regEx = new RegExp(
    "(?:http(s)?:\/\/)?(?:www.)?(?:((?:[-a-z0-9]+[.])*){0,242}((?:[a-z2-7]{16}|[a-z2-7]{56})\.onion)(?::[0-9]{1,6})?)(\/(?:[^\"\'\s]+|$))?",
    "ugi"
);
this.regEx.exec(string);

If I apply the RegEx using other techniques (e.g. on regexr.com) where it seems to work. Executed on Node.js (v8.11.3), it results in the following exception:

Segmentation fault (core dumped)
npm ERR! code ELIFECYCLE
npm ERR! errno 139
npm ERR! uriextractor@0.0.1 debugHost: `node --inspect-brk --use_strict extractor.js`
npm ERR! Exit status 139
npm ERR! 
npm ERR! Failed at the uriextractor@0.0.1 debugHost script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/user/.npm/_logs/2018-06-27T12_15_04_809Z-debug.log

Now I am wondering: Does anybody have an idea where the problem might come from? And how should I debug such an issue?

As of now, I tried to simplify the regex and visualizing it, since I find it easier to see where it might take a wrong turn. The Chrome Dev tools are not much of a help when it comes to debugging issues with the RegEx itself it seems.

jogli5er
  • 21
  • 1
  • 6
  • 1
    You need to double the backslashes for `\.` and `\s` or else use a regexp literal. – Pointy Jun 27 '18 at 14:41
  • 2
    I think a regex like yours is not the best way to do it. Even trying to match `(?:((?:[-a-z0-9]+[.])*){0,242})` on your string produces 174627 steps. Check if you can optimize some stuff. For example, `[.]` is just `.`. Check out for the quantifiers too. **Edit:** this can help : https://www.regular-expressions.info/catastrophic.html – Seblor Jun 27 '18 at 14:47
  • Also that test string appears to have nothing whatsoever to do with the sort of patterns the regex is designed to match. I suspect the problem is with that `*` in the middle but it's hard to say. – Pointy Jun 27 '18 at 14:48
  • @Seblor did you use some sort of tool to analyze that? I'm not questioning what you wrote; I'm just curious because a tool like that would be useful. *edit* oh awesome thanks. – Pointy Jun 27 '18 at 14:49
  • 2
    @Pointy I use http://regex101.com for building and testing regular expressions. – Seblor Jun 27 '18 at 14:49
  • @Pointy: I stripped down everything else that was working, I am not particularly interested in this string, except that it fails my regex ;) – jogli5er Jun 27 '18 at 15:02
  • @Seblor: Agree, I'll try to break it further down. For the [.]: sorry, this must have happened while simplifying the regex... – jogli5er Jun 27 '18 at 15:03
  • @Seblor: regex101.com is really helpful – jogli5er Jun 27 '18 at 15:08
  • You really need to use `^` and move the `$` outwards because otherwise this matches `http://www.foo.onion@www.evil.com/` and `http://evil.com/#http://www.foo.onion/` while unnecessarily rejecting `http://www.foo.onion./`. Be very careful trying to filter URLs with regular expressions. Maybe use https://nodejs.org/api/url.html to validate URLs. – Mike Samuel Jun 27 '18 at 15:15
  • Thank you all for your help, the hints about pulling out the quantifiers and verifying with regex101 helped much. Thank you also for pointing out other errors in the regex – jogli5er Jun 27 '18 at 21:42

0 Answers0