Ruby 1.9 regular expression to match (un)?quoted key-value assignment

Question

I want to match against key/value assignments in shell scripts, config files, etc., which may or may not be single-, double- or backtick-quoted, and which may or may not have a line-ending comment. For example, I want:

RAILS_ENV=production
# => key: RAILS_ENV, value: production

listen_address = 127.0.0.1 # localhost only by default
# => key: listen_address, value: 127.0.0.1

PATH="/usr/local/bin"
# => key: PATH, value: "/usr/local/bin" (or /usr/local/bin would be fine)

HOSTNAME=`cat /etc/hostname`
# => key: HOSTNAME, value: `cat /etc/hostname`

If you feel fancy, it can handle escaped quotes and # inside the quotes, but I don't think I'll run into any. If you feel differently fancy, you can make it all named-capture expanded-style and pretty:

CONFIG_LINE = %r{
  (?<export> export ){0}
  (?<key> [\w-]+ ){0}
  (?<value> \S* ){0}
  (?<comment> \#.*$ ){0}

  ^\s*(\g<export>\s+)?\g<key>\s*=\s*\g<value>\s*(\g<comment>)?$
 }x

but I think nobody really writes regexen like that..

I've seen Regex for quoted string with escaping quotes, but I'm not good enough to adapt any of those solutions to optional quotes; I don't quite see how to do "expect an end quote, and therefore allow internal spaces, if I had a start quote."

Edit: the Tin Man gave a practical answer, so now I'm looking for the purist answer. Throw some state machines at me, or tell me why it can't be done.

With the addition of the `CONFIG_LINE` assignment you are going to have a very hard time parsing all the assignments from a file from a single regex pattern that is easily maintained. — the Tin Man, Dec 26 '11 at 18:52
That's a regex, not something that needs to be matched by the regex! — Jay Levitt, Dec 26 '11 at 19:38
"I think nobody really writes regexen like that". Oh? Scan through some of the the answers to "[What is the best regular expression for validating email addresses?](http://stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses/1917982#1917982)" — the Tin Man, Dec 26 '11 at 20:31
See the 50-point challenge for a regex-only version: http://stackoverflow.com/questions/8658722/challenge-regex-only-tokenizer-for-shell-assignment-like-config-lines — Jay Levitt, Dec 30 '11 at 16:44

the Tin Man · Accepted Answer · 2011-12-26T20:24:23.633

It's probably possible to do in one regex pattern, but I am a believer in keeping the patterns simple. Regex can be insidious and hide lots of little errors. Keep it simple to avoid that, then tweak afterwards.

text = <<EOT
RAILS_ENV=production
listen_address = 127.0.0.1 # localhost only by default
PATH="/usr/local/bin"
EOT

text.scan(/^([^=]+)=(.+)/)
# => [["RAILS_ENV", "production"], ["listen_address ", " 127.0.0.1 # localhost only by default"], ["PATH", "\"/usr/local/bin\""]]

To trim off the trailing comment is easy in a subsequent map:

text.scan(/^([^=]+)=(.+)/).map{ |n,v| [ n, v.sub(/#.+/, '') ] }
# => [["RAILS_ENV", "production"], ["listen_address ", " 127.0.0.1 "], ["PATH", "\"/usr/local/bin\""]]

If you want to normalize all your name/values so they have no extraneous spaces you can do that in the map also:

text.scan(/^([^=]+)=(.+)/).map{ |n,v| [ n.strip, v.sub(/#.+/, '').strip ] }
=> [["RAILS_ENV", "production"], ["listen_address", "127.0.0.1"], ["PATH", "\"/usr/local/bin\""]]

What the regex "/^([^=]+)=(.+)/" is doing is:

"^" is "At the beginning of a line", which is the character after a "\n". This is not the same as the start of a string, which would be \A. There is an important difference so if you don't understand the two it is a good idea to learn when and why you'd want to use one over the other. That's one of those places a regex can be insidious.
"([^=]+)" is "Capture everything that is not an equal-sign".
"=" is obviously the equal-sign we were looking for in the previous step.
"(.+)" is going to capture everything after the equal-sign.

I purposely kept the above pattern simple. For production use I'd tighten up the patterns a little using some "non-greedy" flags, along with a trailing "$" anchor:

text.scan(/^([^=]+?)=(.+)$/).map{ |n,v| [ n.strip, v.sub(/#.+/, '').strip ] }
=> [["RAILS_ENV", "production"], ["listen_address", "127.0.0.1"], ["PATH", "\"/usr/local/bin\""]]

+? means find the first matching '='. It's already implied by the use of [^=] but +? makes that even more obvious to be my intent. I can get away without the ? but it's more of a self-documentation thing for later maintenance. In your use-case it should be benign but is a worthy thing to keep in your Regex Bag 'o Tricks.
$ means the end-of-the-string, i.e., the place immediately preceding the EOL, AKA end-of-line, or carriage-return. It's implied also, but inserting it in the pattern makes it more obvious that's what I'm searching for.

EDIT to track the OP's added test:

text = <<EOT
RAILS_ENV=production
listen_address = 127.0.0.1 # localhost only by default
PATH="/usr/local/bin"
HOSTNAME=`cat /etc/hostname`
EOT

text.scan( /^ ( [^=]+? ) = ( .+ ) $/x ).map{ |n,v| [ n.strip, v.sub(/#.+/, '').strip ] }
=> [["RAILS_ENV", "production"], ["listen_address", "127.0.0.1"], ["PATH", "\"/usr/local/bin\""], ["HOSTNAME", "`cat /etc/hostname`"]]

If I was writing this for myself I'd generate a hash for convenience:

Hash[ text.scan( /^ ( [^=]+? ) = ( .+ ) $/x ).map{ |n,v| [ n.strip, v.sub(/#.+/, '').strip ] } ]
=> {"RAILS_ENV"=>"production", "listen_address"=>"127.0.0.1", "PATH"=>"\"/usr/local/bin\"", "HOSTNAME"=>"`cat /etc/hostname`"}

I would recommend changing the: `/^([^=]+)=(.+)/` to: `/^([^=\r\n]+)=(.+)/`. (The `[^=]+` will span multiple lines.) — ridgerunner, Dec 26 '11 at 17:44
"The [^=]+ will span multiple lines." except it isn't in the code. Notice that in all the examples it's doing the right thing. — the Tin Man, Dec 26 '11 at 17:53
Thanks.. I forgot the important test case, which is quoted spaces, which is what's really tripping me up here - I don't know how to carry state around ("we're inside quotes!") in a regex, though I'm sure it has to do with lookbehind. — Jay Levitt, Dec 26 '11 at 18:13
It's possible to do, but again, makes the regex look even more like line-noise, which isn't good for long-term maintenance. I added the additional test and its output. — the Tin Man, Dec 26 '11 at 18:15
sure, but handling that's a requirement ☺ though it can of course be in mixed Ruby/RE. — Jay Levitt, Dec 26 '11 at 18:17
See the added code. No change to my code, just an additional line in the source. — the Tin Man, Dec 26 '11 at 18:20
you don't need the splat and flatten - `Hash[text.scan( /^ ( [^=]+? ) = ( .+ ) $/x ).map{ |n,v| [ n.strip, v.sub(/#.+/, '').strip ] }]` works just fine — Marek Příhoda, Dec 26 '11 at 19:21
yeah, it had been mine, too ;) (until mu's too short' pointed it out) — Marek Příhoda, Dec 26 '11 at 20:43
FYI, 50-point pure-regex challenge is on now: http://stackoverflow.com/questions/8658722/challenge-regex-only-tokenizer-for-shell-assignment-like-config-lines — Jay Levitt, Dec 30 '11 at 16:44

score 2 · Answer 2 · answered Dec 26 '11 at 22:14

You are not doing yourself a favor if you want to match all these at once. Different configuration files have a different format.

For instance, you know that in a shell file, variables cannot start with a digit and only have letters/underscore afterwards, what's more, if quoted, they can use either single quotes or double quotes, in which case escaping one or the other is different... And this is not to mention arithmetic evaluation etc.

So, just for shell variables, you have to do with several regexes:

^([A-Za-z_]\w*)=(.*) and capturing $1, this gives you the variable name;
for $2, you have these possibilities

^"[^"]*(\\"[^"]*)*"$ # values in double quotes

^'[^']*('\\''[^']*)*'$ # values in single quotes

\$[A-Za-z_]\w*$ # simple variable interpolation ` And this does not even take backtick values (which can be nested!!) into account (if they are not, then it is quite simple).

Here are a few regexes, but they won't even handle all cases.

Ruby 1.9 regular expression to match (un)?quoted key-value assignment

2 Answers2

Linked