In what order does Bash parser escape characters and split words/tokens within command line?

Question

I am trying to decidedly understand Bash parser’s order of business.

This wiki page claims the following order:

Read line.

Process/remove quotes.

Split on semicolons.

Process 'special operators', which according to the article are:

Command groupings and brace expansions, e.g. {…}.

Process substitutions, e.g. cmd1 <(cmd2).

Redirections.

Pipelines.

Perform expansions, which are not all listed, but should include:

Brace expansion, e.g. {1..3}. For some reason the article tucks this into previous stage.

Tilde expansion, e.g. ~root.

Parameter & variable expansion, e.g. ${var##*/}.

Arithmetic expansion, e.g. $((1+12)).

Command substitution, e.g. $(date).

Word splitting, that applies to the results of the expansions; uses $IFS.

Pathname expansion, or globbing, e.g. ls ?d*.

Word splitting, that applies to the whole line; does not use $IFS.

Execution.

This is not a quote, but paraphrased contents of the linked article.

Furthermore there are Bash man pages, and this SO answer claiming to be based on those pages. According to the answer, stages of command parsing are as follows:

initial word splitting

brace expansion

tilde expansion

parameter, variable and arithmetic expansion

command substitution

secondary word splitting

path expansion (aka globbing)

quote removal

Emphasis mine.

I am assuming, that by “initial word splitting” the author means splitting of the entire line, and by “secondary word splitting” they mean splitting of the results of the expansions. This would entail that there exist at least two distinct processes of tokenization during command parsing.

Considering the ordering contradictions between two sources, what is the actual order in which the input command line is being de-quoted and split into words/tokens, relative to the other operations being performed?

EDIT NOTE:

To explain part of the answers, earlier version of this question had a sub-question:

Why does cmd='var=foo';$cmd produce bash: var=foo: command not found?

Have you tried reading the [Bash manual](http://www.gnu.org/software/bash/manual/html_node/index.html)? — Barmar, Jan 12 '19 at 17:02
"Word splitting" is a very specific term in shell scripting, and it happens exactly once. It does not generally refer to splitting up a string. Parsing a string according to the shell grammar is not considered word splitting, nor is a command like `awk '{ print $2 }'`. — that other guy, Jan 12 '19 at 17:22
@CharlesDuffy Edited into a single question. Would you say it’s specific enough? — CBlew, Jan 12 '19 at 18:07
@cblew: i edited my answer in an attempt to conform to your new question. On the whole, it is better to ask new questions rather than invalidating already provided answers by changing the question they are supposedly answering. — rici, Jan 12 '19 at 18:57
@CBlew, ...frankly, it's a fair bit less clear to me what you're asking now than it was before. Why a *specific* command is parsed in a given way is a narrowly-scoped question amenable to canonical answers that are easy to evaluate for correctness. "Is my understanding of the bash parser correct?", with a multi-paragraph description, is hard to call narrow. — Charles Duffy, Jan 12 '19 at 20:31
@CBlew, ...from https://stackoverflow.com/help/dont-ask, note also the specification: *If you can imagine an entire book that answers your question, you’re asking too much*. I'm not sure a *complete* answer to this is less than a full textbook chapter (or the POSIX shell command language specification, which, already exists as a complete document and makes little sense to reproduce here in less-authoritative forms). — Charles Duffy, Jan 12 '19 at 20:34
@CharlesDuffy, thank you for your input. I apologize for changing the question after you posted your answer. As it stands, I feel that my question is at the upper limit of what can be called ‘narrow.’ I’ve done my best to wrap my head around the subject, and wrote up a short summary for anyone who might follow my footsteps. I did it, because I wish I had it earlier. Please, comment if you find yourself in disagreement. — CBlew, Jan 12 '19 at 22:44
Same goes to @rici (can’t tag more than one additional user). — CBlew, Jan 12 '19 at 22:45

rici · Answer 1 · 2019-01-12T21:07:51.353

Posix sets out a precise procedure for shell interpretation. However, most shells -- including bash -- add their own syntax extensions. Also, the standard doesn't insist that it's algorithm actually be used; just that the end result is the same. So there are some differences between the standard algorithm and descriptions concerning individual shells. Nonetheless, the broad outline is the same.

It is important to understand the differences between tokenisation and word-splitting. Tokenisation divides the input into syntactically significant tokens, which are then used by the shell grammar to syntactically analyse the input. Syntactic tokens include things like semicolons and parentheses ("operators" in the terminology of the standard). One particular type of token is a WORD.

Tokenisation is, as noted by the standard, basically the first step in parsing the input (but, as noted below, it depends on the identification of quoted characters.)

WORDs may be subsequently interpreted by applying various expansions. The precise set of expansions applied to each word depends on the grammatical context; not all words are treated the same. This is documented in the narrative text of the standard. One transformation which is applied to some WORDs is word-splitting, which splits one WORD into a list of WORDs based on the presence of field-separator characters, by default whitespace (and configurable by changing the value of the IFS shell variable). Word-splitting does not change the syntactic token type; indeed, by the time it happens, syntactic analysis is complete.

Not all WORDs are subject to word-splitting. In particular, word-splitting is not performed unless there was some expansion, and then only if the expansion was not inside double quotes. (And even then, not in all syntactic contexts.)

The algorithm for dividing the input into tokens must be equivalent to that in the standard. This algorithm requires that it be known which characters have been quoted; most historical implementations do that by internally flagging each input character with a "quoted" bit. Whether or not the quoting characters are removed during tokenisation is somewhat implementation-dependent; the standard puts the quote removal step at the end but an implementation could do it earlier if the end result is identical.

Note that = is not an operator character, so it does not cause var=foo to be split into multiple tokens. However, tokens which start with an identifier followed by = are treated specially by the shell parser; they are later treated as parameter assignments. But, as mentioned above, word-splitting does not change the syntactic nature of a WORD, so WORDs resulting from word-splitting which happen to look like parameter assignments are not treated as such by the shell parser.

score 2 · Answer 2 · answered Jan 12 '19 at 17:54

The very first step in shell parsing is applying shell grammar rules which are obligated to provide a superset of the syntax specified in the POSIX shell command language grammar specification.

It's only in this initial stage where assignments can be detected, and only under very specific circumstances:

The ASSIGNMENT_WORD token must be produced by the parser (note that the parser runs only once, and does not rerun after any expansions have taken place!)
The = character itself, and the valid variable name preceding it, must not be quoted.

The parser is never rerun on expansion results without an explicit invocation of eval (or passing the results to another shell as code, or taking some comparable explicit action), so the results of an expansion will never generate an assignment if the operation did not parse as an assignment prior to that expansion taking place.

score 1 · Answer 3 · answered Jan 12 '19 at 22:32

I agree, that my question was asking for a lot, and I deeply appreciate all valuable input. My gratitude to @rici and @CharlesDuffy.

Below is the rough outline of how Bash interprets and executes code.

Stage 1: Line feed

Shell reads input in terms of lines.

Stage 2: Tokenization

Line is chopped into tokens — words and operators, delimited by metacharacters. Quoting (\, '…', "…") is respected, aliases are substituted, comments are removed. Token boundaries are recorded internally.

Metacharacters are: <space>, <tab>, <newline>, |, &, ;, (, ), <, >.

Stage 3: Command parsing

Line is parsed for pipelines, lists, and compound commands (loops, conditionals, groupings). This gives Bash the idea of the ordering in which it will carry out sub-commands. Each sub-command is then processed individually by its own parsing cycle.

Stage 4: Grammar

Assignments (those to the left of command name) and redirections are removed and saved for later.

Stage 5: Expansions

Expansions are performed, in order:

Brace expansion, e.g. {1..3}.
Tilde expansion, e.g. ~root.
Parameter & variable expansion, e.g. ${var##*/}.
Arithmetic expansion, e.g. $((1+12)).
Command substitution, e.g. $(date).
Process substitution, where supported, e.g. cat <(ls).
Word splitting, applies to the unquoted results of the expansions, uses IFS variable for delimiters.
Filename expansion, or globbing, e.g. ls ?d*.
Quote removal: all unquoted \, ‘, and ", not resulting from expansions, are purged.

Stage 6: Redirections

Redirections are performed now, then removed. Previous redirections from pipelines may be overridden.

If the line contains no command name, redirections affect nothing; otherwise they affect only said command.

Stage 7: Assignments

Assignments are performed now, then removed. Their values (to the right of =) undergo:

tilde expansion,
parameter expansion,
command substitution,
arithmetic expansion,
quote removal.

If the line contains no command name, assignments affect current shell environment; otherwise they exist only for said command.

Stage 8: Command and arguments

At this point, if no command name results, the command exits.

Otherwise, the first word of the line becomes the command, the following words — arguments.

Stage 9: Execution

Now, to answer my question.

As follows from the above:

Tokenization occurs in stage 2; word splitting occurs in stages 5 and 7. The two are different concepts.
Quotes (and backslashes) come into play in stage 2, and are generally removed in stage 5. For assignments, they live until stage 7.
Assignments are recognized in stage 4, so they can’t come from variable expansion, which occurs in stage 5.

On the whole, I think it is better to point people at the Posix standard, which is precise and reasonably clear (although it needs to be read as a whole). Summaries found on the internet, including some of the ones you cited originally, are probably well-meaning, but they tend to miss details or fail to put the entire process in context. For example, word-splitting does not apply to redirections or variable assignments. (Also, word splitting applies to the result of unqouted expansion, not the unquoted result.) — rici, Jan 12 '19 at 23:30
What can be useful are specific questions, such as "Why did (this) result (that) rather than (what I expected)?" Those sort of questions focus on real issues experienced in practice, and if well-worded can produce better search results. That's essentially the justification for SO's format and question guidelines. — rici, Jan 12 '19 at 23:35