14

I would like to parse binary files in Raku using its regex / grammar engine, but I didn't found how to do it because the input is coerce to string.

Is there a way to avoid this string coercion and use objects of type Buf or Blob ?

I was thinking maybe it is possible to change something in the Metamodel ?

I know that I can use unpack but I would really like to use the grammar engine insted to have more flexibility and readability.

Am I hitting an inherent limit to Raku capabilities here ?

And before someone tells me that regexes are for string and that I shouldn't do it, it should point out that perl's regex engine can match bytes as far as I know, and I could probably use it with Regexp::Grammars, but I prefer not to and use Raku instead.

Also, I don't see any fundamental reason why regex should be reserved only to string, a NFA of automata theory isn't intriscally made for characters instead of bytes.

ikegami
  • 367,544
  • 15
  • 269
  • 518
WhiteMist
  • 885
  • 4
  • 13
  • 1
    See also [**Binary regex**](https://www.reddit.com/r/rakulang/comments/qqcnzr/binary_regex/) that's just been started in reddit sub /r/rakulang in response to your Q. It would be helpful if you read the links given in the post or at least engaged in discussion there about your "needs/wants/wishlist". This would be a way to document what you're after, which would be a valuable contribution to Raku, and you'd also be able to point others to it if/when you do whatever you plan to do in perl or whatever and need others to know what you're after. – raiph Nov 09 '21 at 23:02
  • Thanks for the info I will go there ! What I'm after is developping some tools to do reverse engineering and binary analysis. So for now I'm just training. I've written a little perl script that extract a subset of ID3 tag information of mp3 files, using unpack. But this is not scalabe for a bigger project and that's why I would like to use regex/grammars. – WhiteMist Nov 10 '21 at 06:06
  • Making some research around this, I found a bug, I think, in the regex engine. I've reported it here : https://github.com/rakudo/rakudo/issues/4535 – WhiteMist Nov 10 '21 at 06:06

1 Answers1

11

Is there a way to avoid this string coercion and use objects of type Buf or Blob ?

Unfortunately not at present. However, one can use the Latin-1 encoding, which gives a meaning to every byte, so any byte sequence will decode to it, and could then be matched using a grammar.

Also, I don't see any fundamental reason why regex should be reserved only to string, a NFA of automata theory isn't intriscally made for characters instead of bytes.

There isn't one; it's widely expected that the regex/grammar engine will be rebuilt at some point in the future (primarily to deal with performance limitations), and that would be a good point to also consider handling bytes and also codepoint level strings (Uni).

Jonathan Worthington
  • 29,104
  • 2
  • 97
  • 136
  • 2
    `say "abbc" ~~ / "\x62"+ /; # 「bb」` – Elizabeth Mattijsen Nov 09 '21 at 19:26
  • Thank you for your answer. If this is not possible now, or only with a little hacking, at least I'm glad to know that it will be a consideration when designing the next regex/grammar engine. – WhiteMist Nov 10 '21 at 05:27
  • We can indeed decode and encode from and to latin-1 without losing any information, so for now this can be a solution. We can do something like this for example : match a Byte Order Mark encoded in UTF-8 – WhiteMist Nov 10 '21 at 05:52
  • 3
    `raku -e 'say (Blob.new(239, 187, 191).decode("latin-1") ~~ / ^ ("\x[EF]\x[BB]\x[BF]") $ /)[0].encode("latin-1")'` – WhiteMist Nov 10 '21 at 05:53