0

Is this regex value safe from extreme values in the context of ruby language, please ?

a = b.match(%r{(.*)/(.*)})

I mean there is no backtracking possible, right?

I tried benchamarking this value but I believe it ok to use it.

aaaaaaaaaaaaaaa/aaaaaaaaaaaaaaa
Sim
  • 103
  • 1
  • 6
  • 5
    Why do you think there's no backtracking? The first `.*` matches anything, including `/`, so it will first match everything it can before giving up one character by one until a `/` appears. Do you just want to [check if there's a `/` in the middle of the string](https://stackoverflow.com/q/8258517)? Or do you want to [split the string by the last `/`](https://stackoverflow.com/q/1844118)? – InSync Jul 04 '23 at 16:04
  • 2
    Can you be more specific? What do you consider "extreme values"? Where does your input come from? And what's the purpose of your regular expression? – Stefan Jul 04 '23 at 16:28
  • I mean by "extrame values" those values that may produce a ReDoS @Stefan – Sim Jul 04 '23 at 18:58
  • @InSync I am looking for ReDos possibility... – Sim Jul 04 '23 at 18:59
  • No there is no backtracking issues. The bad formula is two or more literals separated by the dreaded `.*`. If it finds a single `/` it goes to the end. using the final `.*` No backtracking in a catastrophic sense at all since the second `.*` is never encountered until a `/` is found. Doesn't matter that `.` will match `/`, the search progresses backward from the end until it finds the first `/` it comes across https://regex101.com/r/UauG28/1 as seen here taking only 12 steps. – sln Jul 04 '23 at 19:28
  • Backtracking is a core tenet of regular expressions. Certain situations can trigger Catastropic backtracking events. I'm sure your question relates to that. And no, that form is safe. – sln Jul 04 '23 at 19:36
  • 1
    Why not `part1, _, part2 = b.partition "/"` instead? Or even just `part1, part2 = b.split "/"`? Seems like a greedy capturing regexp for no useful purpose. Whether the current code triggers some condition or not, it's still a suboptimal approach to splitting a string on a single character. You could even use String methods without regexp and get similar results. – Todd A. Jacobs Jul 05 '23 at 15:55
  • @ToddA.Jacobs - at the heart of every c/c++ comparison function is a begin and end target pointer and a pointer to the string to find. These functions are core C lib functions common to all higher level languages that find strings. Being one step above assembly, `while ( *ptr1++ != *ptr2 )`. There is no difference computation wise between using split() vs using regex. It's a myth where _suboptimal_ may sound expertia, it's absolutely baloney, and is just your opinion that exists outside the realm of fact. – sln Jul 05 '23 at 17:57
  • @sln, pretty bold statement. Can you maybe confirm it with some benchmarks? Because I'm personally, inclined to believe Todd here: regexes have side load of parsing expression, and processing passed input string with parsed state machine. – markalex Jul 05 '23 at 18:09
  • @markalex - What did I say about C string primitives that you don't understand ? – sln Jul 05 '23 at 18:12
  • @sln, specifically part, how it's related to matching `(.*)/(.*)`. Are you implying, that regex engine used by ruby will optimize it somehow to mentioned primitives? – markalex Jul 05 '23 at 18:16
  • @markalex - You're confused about my point. All higher level languages are built on C. All higher level string find/search/split, et all use the exact same C primitives without exception. So if you find regex too challenging don't say its _suboptimal_. – sln Jul 05 '23 at 18:20
  • @sln, then I don't understand your complaint about Todd's comment. They suggested using split instead of regex. How is it related to C primitives being same over different languages? – markalex Jul 05 '23 at 18:25
  • @markalex - Then go ahead and benchmark split using regex vs just using compiled regex. They both use C code like `while ( *ptr1++ != *ptr2 )`, or call string compare C primitives like `strncmp()`. In fact every string find function does. So it's up to you to define the difference in performance. Something besides _suboptimal_ . – sln Jul 05 '23 at 18:35
  • 1
    I don't have Ruby (nor fresh Ruby knowledge) at hand. But [here](https://ideone.com/K2DvMW) is very small test in Python. It consistently shows 50% more time for regex way (for bigger arrays, too). Is that suboptimal enough for you? Also, if you intend on keeping this discussion, please feel free to move it to the chat. – markalex Jul 05 '23 at 19:01
  • I don't see any disparity https://ideone.com/H4l2Bz Maybe you could print out counts and result values along the way in the loops. Subtract intermediate assignments from the time. You know a little more info than suboptimal. – sln Jul 05 '23 at 19:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254384/discussion-between-markalex-and-sln). – markalex Jul 05 '23 at 19:26
  • @sln, let's move to chat. – markalex Jul 05 '23 at 19:28
  • @markalex - Fwiw I added comments in the chat. – sln Jul 05 '23 at 21:19
  • @sln In general, Ruby String methods are much faster than Regexp methods regardless of the underlying VM implementation. There are many reasons for this, so feel free to open a related question, but operations like String#delete were benchmarked as more than 400% faster than String#gsub on Ruby 2.7.5. On 3.2.2 with YJIT, String#include? is about 13% faster than an unanchored Regexp with captures. Regexp objects do more than scan String objects, and many String methods are highly optimized. Plus, Ruby runs on multiple VMs so you need to benchmark *your* code on your VM. – Todd A. Jacobs Jul 05 '23 at 23:18

0 Answers0