50

I am looking for a non-technical explanation of the difference between DFA vs NFA engines, based on their capabilities and limitations.

james.garriss
  • 12,959
  • 7
  • 83
  • 96
blunders
  • 3,619
  • 10
  • 43
  • 65
  • http://en.wikipedia.org/wiki/Deterministic_finite-state_machine – SilentGhost Oct 20 '10 at 13:40
  • 8
    @SilentGhost I know its not heavy on the math, but that article depends on one knowing all all the mathmatical symbolism they have not explained is. Like many wikipedia articles it was written by someone who knows the subject so well, the can't see it from a beginners perspective, and lets face it that's who's going to read the article the most. – Andrew S Mar 15 '14 at 02:05
  • 1
    [This is by far the best explanation I have seen of DFAs vs NFAs](https://swtch.com/~rsc/regexp/regexp1.html). TL;DR, DFAs and NFAs have the same capabilities and limitations but some implementations using NFAs have bad performance. – Timmmm Jan 20 '22 at 13:09

5 Answers5

83

Deterministic Finite Automatons (DFAs) and Nondeterministic Finite Automatons (NFAs) have exactly the same capabilities and limitations. The only difference is notational convenience.

A finite automaton is a processor that has states and reads input, each input character potentially setting it into another state. For example, a state might be "just read two Cs in a row" or "am starting a word". These are usually used for quick scans of text to find patterns, such as lexical scanning of source code to turn it into tokens.

A deterministic finite automaton is in one state at a time, which is implementable. A nondeterministic finite automaton can be in more than one state at a time: for example, in a language where identifiers can begin with a digit, there might be a state "reading a number" and another state "reading an identifier", and an NFA could be in both at the same time when reading something starting "123". Which state actually applies would depend on whether it encountered something not numeric before the end of the word.

Now, we can express "reading a number or identifier" as a state itself, and suddenly we don't need the NFA. If we express combinations of states in an NFA as states themselves, we've got a DFA with a lot more states than the NFA, but which does the same thing.

It's a matter of which is easier to read or write or deal with. DFAs are easier to understand per se, but NFAs are generally smaller.

Dan Bechard
  • 5,104
  • 3
  • 34
  • 51
David Thornley
  • 56,304
  • 9
  • 91
  • 158
  • 1
    Ok, a now-deleted answer confused NFA's with DFA's. I've seen people do that before, and apparently that's down to an otherwise useful book, or so this claims anyhow: http://fanf.livejournal.com/37166.html – Eamon Nerbonne Oct 20 '10 at 18:05
17

Here's a non-technical answer from Microsoft:

DFA engines run in linear time because they do not require backtracking (and thus they never test the same character twice). They can also guarantee matching the longest possible string. However, since a DFA engine contains only finite state, it cannot match a pattern with backreferences, and because it does not construct an explicit expansion, it cannot capture subexpressions.

Traditional NFA engines run so-called "greedy" match backtracking algorithms, testing all possible expansions of a regular expression in a specific order and accepting the first match. Because a traditional NFA constructs a specific expansion of the regular expression for a successful match, it can capture subexpression matches and matching backreferences. However, because a traditional NFA backtracks, it can visit exactly the same state multiple times if the state is arrived at over different paths. As a result, it can run exponentially slowly in the worst case. Because a traditional NFA accepts the first match it finds, it can also leave other (possibly longer) matches undiscovered.

POSIX NFA engines are like traditional NFA engines, except that they continue to backtrack until they can guarantee that they have found the longest match possible. As a result, a POSIX NFA engine is slower than a traditional NFA engine, and when using a POSIX NFA you cannot favor a shorter match over a longer one by changing the order of the backtracking search.

Traditional NFA engines are favored by programmers because they are more expressive than either DFA or POSIX NFA engines. Although in the worst case they can run slowly, you can steer them to find matches in linear or polynomial time using patterns that reduce ambiguities and limit backtracking.

[http://msdn.microsoft.com/en-us/library/0yzc2yb0.aspx]

james.garriss
  • 12,959
  • 7
  • 83
  • 96
  • 17
    The MSDN article is quite misleading; NFAs and DFAs are equally powerful. NFA algorithms do not require backtracking (which has worst case exponential behavior). The reason backtracking is needed is because "regular expressions" are much more powerful (e.g. backreferences) than regular languages and so they can't be modeled by the canonical NFA/DFAs. Example of well-implemented NFA algorithm that doesn't use backtracking: http://swtch.com/~rsc/regexp/regexp1.html – Rufflewind Jun 15 '13 at 01:12
  • On the topic of catastrophic backtracking: http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016 – Eamon Nerbonne Jul 20 '16 at 20:25
  • The MS article is in fact wrong if you stick with the standard definitions: An NFA does not backtrack at all. Certain backtracking automata are easier to implement if you start with an NFA, and maybe the article is referring to such an implementation. – toolforger Dec 01 '18 at 19:09
  • Here's an reference implementation of a non-backtracking NFA algorithm: https://github.com/iter-tools/regex. It's simplest to think of it as evolving the pattern. Consuming an input character eliminates some branches of the pattern that can no longer match, and alters (maybe) others that still can. If you started with the pattern `b|.a|abc`, applying input `a` would get you pattern `a|bc`. Applying input `b` would get you pattern `c`. At this point the pattern matches if the next character is `c`. – conartist6 Feb 02 '22 at 16:39
7

A simple, nontechnical explanation, paraphrased from Jeffrey Friedl's book Mastering Regular Expressions.

CAVEAT:

While this book is generally considered the "regex bible", there appears some controversy as to whether the distinction made here between DFA and NFA is actually correct. I'm not a computer scientist, and I don't understand most of the theory behind what really is a "regular" expression, deterministic or not. After the controversy started, I deleted this answer because of this, but since then it has been referenced in comments to other answers. I would be very interested in discussing this further - can it be that Friedl really is wrong? Or did I get Friedl wrong (but I reread that chapter yesterday evening, and it's just like I remembered...)?

Edit: It appears that Friedl and I are indeed wrong. Please check out Eamon's excellent comments below.


Original answer:

A DFA engine steps through the input string character by character and tries (and remembers) all possible ways the regex could match at this point. If it reaches the end of the string, it declares success.

Imagine the string AAB and the regex A*AB. We now step through our string letter by letter.

  1. A:

    • First branch: Can be matched by A*.
    • Second branch: Can be matched by ignoring the A* (zero repetitions are allowed) and using the second A in the regex.
  2. A:

    • First branch: Can be matched by expanding A*.
    • Second branch: Can't be matched by B. Second branch fails. But:
    • Third branch: Can be matched by not expanding A* and using the second A instead.
  3. B:

    • First branch: Can't be matched by expanding A* or by moving on in the regex to the next token A. First branch fails.
    • Third branch: Can be matched. Hooray!

A DFA engine never backtracks in the string.


An NFA engine steps through the regex token by token and tries all possible permutations on the string, backtracking if necessary. If it reaches the end of the regex, it declares success.

Imagine the same string and the same regex as before. We now step through our regex token by token:

  1. A*: Match AA. Remember the backtracking positions 0 (start of string) and 1.
  2. A: Doesn't match. But we have a backtracking position we can return to and try again. The regex engine steps back one character. Now A matches.
  3. B: Matches. End of regex reached (with one backtracking position to spare). Hooray!
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • **@Tim_Pietzcker:** Thanks for posting the NFA engine steps too. Deleted the other question... :-) – blunders Oct 20 '10 at 15:50
  • 3
    This answer is inaccurate - catastrophic backtracking is orthogonal to the whole NFA/DFA distinction. And what you describe as a DFA is actually an NFA (using the typical superposition of states) - DFA's are only ever in one state, hence "deterministic", and NFA's may be in multiple states, hence non-deterministic. – Eamon Nerbonne Oct 20 '10 at 16:18
  • 7
    In some sense, it's all just terminology. Having said that, DFA's are deterministic (and it's in the name) and NFA's are non-deterministic (again, as the name says). That has a pretty-straightforward reason: DFA are always in exactly one state, and when presented with a character, there's always one unique (deterministic) next state that character always corresponds to. So, you first explanation is a fine regex algorithm, but it's not a DFA - obviously, and as you describe, there can be multiple options and you never know which one is "best" until the string ends. – Eamon Nerbonne Oct 21 '10 at 10:18
  • 5
    Your second algorithm labeled **NFA engine** is indeed one possible NFA implementation. It resolves the same ambiguity are your first (also NFA) algorithm differently: namely by just picking one option and backtracking as needed. So, it is indeed an NFA, but it's not the *only possible* NFA, as your first method demonstrates: that deals with the same nondeterminism differently. I suppose you could call this a backtracking NFA engine, to distinguish the two. – Eamon Nerbonne Oct 21 '10 at 10:22
  • 3
    Finally, it's worth nothing that in *any* finite state automaton, as the name implies, the states are *finite* - more specifically, after embedding any relevant info into the state, that tuple still needs to have a finite number of options. And *that* means that strictly speaking, perl-compatible engines aren't really *any* type of FSA, neither DFA nor NFA: after all, you can include arbitrary-length backreferences, and there's an **infinite** number of arbitrary length strings. – Eamon Nerbonne Oct 21 '10 at 10:26
  • 3
    The distinction is critically important to performance because the infinite state space mean that you can't precompile the NFA nor efficiently execute it using the first algorithm. In the general case backtracking breaks when faces with regexes (and you come across these *in practice*) which cause catastrophic backtracking. – Eamon Nerbonne Oct 21 '10 at 10:32
  • 2
    So, there might be some non-NFA means of doing backreferences efficiently (I suspect that actually there is), but NFA can't be used to deal with backreferences. Strictly speaking they can't do it at all, and loosely speaking and permitting unbounded annotation they can't do it reliably. – Eamon Nerbonne Oct 21 '10 at 10:33
  • @Eamon: Thank you very much for this insightful information. It's a shame that upvoting your comments won't give you rep. – Tim Pietzcker Oct 21 '10 at 11:38
  • @Eamon Nerbonne: Sorry, but before I came to your explanation, everything was clear. You mean that both examples are NFAs. I sort of agree. I see why DFA can't handle all the advances (actually non-regular) regex stuff. What I'm missing is an example DFA for simple regexes. Or would you agree that the first example could be transformed to DFA by defining the DFA states as sets of the NFA states and just rephrasing the algorithm? – maaartinus Nov 14 '13 at 02:42
  • 1
    Yep, that's pretty much it. Wikipedia has [an example NFA](http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton#Example) and [an example DFA](http://en.wikipedia.org/wiki/Deterministic_finite_automaton#Example). – Eamon Nerbonne Nov 14 '13 at 22:33
4

Both NFAs and DFAs are finite automata, as their names say.

Both can be represented as a starting state, a success (or "accept") state (or set of success states), and a state table listing transitions.

In the state table of a DFA, each <state₀, input> key will transit to one and only one state₁.

In the state table of an NFA, each <state₀, input> will transit to a set of states.

When you take a DFA, reset it to it's start state, give it a sequence of input symbols, and you will know exactly what end state it's in and whether it's a success state or not.

When you take an NFA, however, it will, for each input symbol, look up the set of possible result states, and (in theory) randomly, "nondeterministically," select one of them. If there exists a sequence of random selections which leads to one of the success states for that input string, then the NFA is said to succeed for that string. In other words, you are expected to pretend that it magically always selects the right one.

One early question in computing was whether NFAs were more powerful than DFAs, due to that magic, and the answer turned out to be no since any NFA could be translated into an equivalent DFA. Their capabilities and limitations are exactly precisely the same as one another.

For those wondering how real, non-magical, NFA engine can "magically" select the right successor state for a given symbol, this page describes the two common approaches.

BenGoldberg
  • 415
  • 3
  • 6
0

I find the explanation given in Regular Expressions, The Complete Tutorial by Jan Goyvaerts to be the most usable. See page 7 of this PDF:

https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

Among other points made on page 7, There are two kinds of regular expression engines: text-directed engines, and regex-directed engines. Jeffrey Friedl calls them DFA and NFA engines, respectively. ...certain very useful features, such as lazy quantifiers and backreferences, can only be implemented in regex-directed engines.

RBV
  • 1,367
  • 1
  • 14
  • 33
  • I believe that reference was the basis of what is now [Regular-Expressions.info](https://www.regular-expressions.info) It's the same author and I recognize some of the phrasing. – Scratte Feb 04 '21 at 03:45