0

What is the best way to convert a regular expression to a string which can be accepted by grep/sed in bash?

for example, given the following regular expression

(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

bash does not like it (and thus this regular expression cannot be used in grep)

$ echo "(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"
-bash: syntax error near unexpected token `('

$ echo '(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'
>
> ^C

i assume that the regular expression needs to be escaped, but i didn't find any good tool that can do it for me.

any idea how can i let grep use this regular expression in bash?

M--
  • 25,431
  • 8
  • 61
  • 93
Mr.
  • 9,429
  • 13
  • 58
  • 82
  • 1
    If i am not wrong, you are trying to use this regex to capture emails. In this aspect you could consider to use `grep -E -o "your regex here" file.txt` - Also check out this post: https://stackoverflow.com/questions/2898463/using-grep-to-find-all-emails – George Vasiliou Jun 19 '23 at 12:16
  • 1
    You need to quote the entire thing. Your string has a `"` inside it which ends the quotation early and messes things up. Escape the quotes inside the string with `\"` so that the entire string is quoted. – Verpous Jun 19 '23 at 12:16
  • @GeorgeVasiliou as you can see, i cannot surround the regex in double\single quotes. that is why i open the case. i wonder how can this be done :) – Mr. Jun 19 '23 at 12:41
  • @Verpous i know it needs to be escaped, that is what i wrote. i am after a command\tool that will do that for me without human interaction. – Mr. Jun 19 '23 at 12:41
  • 1
    @Mr. Where does the regex arrive from then, and how do you intend to use it? Can you edit your post with a more concrete example of your use case that illustrates this? – Verpous Jun 19 '23 at 15:34

2 Answers2

0

Let's combine two useful Bash features to get there.

First, you can completely avoid the need to escape a string using a Here Doc with quoted delimiter (ie. <<"separator"). For example, you can write something like this:

cat<<"EOF"
(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
EOF

Second, by wrapping that Here Doc into a function, you can easily grab this to a variable. From that point, you can directly provide that variable to grep or sed.

For example:

function regex() {
cat<<"EOF"
(?:[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
EOF
}

echo "email@test.com" | grep -P $( regex )

Note that your regex require a Perl-compliant regex engine (aka. PCRE). Escaped hexadecimal sequences inside character class expressions (ie. [\x70-\x7f]) are not supported by most other engines, which means that the previous sequence would match on these characters: \, x, 7, 0-\, x, 7, f).

James
  • 4,211
  • 1
  • 18
  • 34
  • maybe it escapes the string, yet it affects the regex and thus the regex does not capture any email address – Mr. Jun 19 '23 at 13:07
  • Indeed. I modified explanation for how to use the regex directly with `egrep`, without the `printf '%q'` part. – James Jun 19 '23 at 13:50
0

The only thing you need to know is how to enclose a string with single quotes if the string includes single quote(s) inside. Let me simplify the string as an example:

O'Reilly

As you know, a backslash does not work to escape the single quote within single quotes:

str='O\'Reilly'         # wrong

Instead you can say:

str='O'\''Reilly'

It may look weired but it is just a concatenation of 'O', \' and 'Reilly'.

'O'      ... single quoted string "O"
\'       ... literal single quote
'Reilly' ... single quoted string "Reilly"

Then you can assign a variable to your regex with:

regex='(?:[a-z0-9!#$%&'\''*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'\''*+\/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'

# echo "$regex"

grep -P "$regex" <<< 'email@example.com'

Please note the two single quotes are handled as the example above.

tshiono
  • 21,248
  • 2
  • 14
  • 22