Is there any easy way to go about adding custom extensions to a Regular Expression engine? (For Python in particular, but I would take a general solution as well).
It might be easier to explain what I'm trying to build with an example. Here is the use case that I have in mind:
I want users to be able to match strings that may contain arbitrary ASCII characters. Regular Expressions are a good start, but aren't quite enough for the type of data I have in mind. For instance, say I have data that contains strings like this:
<STX>12.3,45.6<ETX>
where <STX>
and <ETX>
are the Start of Text/End of Text characters
0x02 and 0x03. To capture the two numbers, it would be very
convenient for the user to be able to specify any ASCII
character in their expression. Something like so:
\x02(\d\d\.\d),(\d\d\.\d)\x03
Where the "\x02" and "\x03" are matching the control characters and the first and second match groups are the numbers. So, something like regular expressions with just a few domain-specific add-ons.
How should I go about doing this? Is this even the right way to go? I have to believe this sort of problem has been solved, but my initial searches didn't turn up anything promising. Regular Expression have the advantage of being well known, keeping the learning curve down.
A few notes:
- I am not looking for a fixed parser for a particular protocol - it needs to be general and user configurable
- I really don't want to write my own regex engine
- Although it would be nice, I am not looking for "regex macros" where I create shortcuts for a handful of common expressions. (perhaps a follow-up question...)
- Bonus: Have you heard of any academic work, i.e "Creating Domain Specific search languages"
EDIT: Thanks for the replies so far, I hadn't realized Python re
supported arbitrary ascii chars. However, this is still not quite what I'm looking for. Here is another example that hopefully give the breadth of what I want in the end:
Suppose I have data that contains strings like this:
$\x01\x02\x03\r\n
Where the 123
forms two 12-bit integers (0x010 and 0x023). So how could I add syntax so the user could match it with a regex like this:
\$(\int12)(\int12)\x0d\x0a
Where the \int12
's each pull out 12 bits. This would be handy if trying to search for packed data.