5

I face a challenge to match the input in the following format:

  • The input consists of key=value pairs. The key starts with slash. The value may be a number or a string in quotes.
  • The value may optionally contain escaped quotes, that is quote following by a quote (""). Such escaped quote should be considered a part of value. There is no need to check that escaped quotes are balanced (e.g. ends by another escaped quote).

The regular expression should match the given key=value part of the sequence and should not break for long inputs (e.g. value is 10000 characters).

First I came to this solution:

/(\w+)=(\d+|"(?:""|[^"])+"(?!"))

and it performs not bad, however it fails in Java6 with StackOverflowError for long inputs (cashes regexplanet for example). I tried to improve it a bit to run faster:

/(\w+)=(\d+|"(?:""|[^"]+)+"(?!"))

but then if input is not matching, it enters endless loop in backtracking trying to match it.

Then I came to this regex:

/(\w+)=(\d+|".+?(?<!")(?:"")*"(?!"))

which is performing slower, but it seems to solve the task.

Can anyone suggest a better / faster regex?

Sample input:

/mol_type="protein" /transl_table=11 /note="[CDS] (""multi
line)"  nn  /organism="""Some"" Sequence" nn  /organism="Some ""Sequence"""
/translation="MHPSSSRIPHIAVVGVSAIFPGSLDAHGFWRDILSGTDLITDVPSTHWLVE
DYYDPDPSAPDKTYAKRGAFLKDVPFDPLEWGVPPSIVPATDTTQLLALIVAKRVLEDAAQGQFE
SMSRERMSVILGVTSAQELLASMVSRIQRPVWAKALRDLGYPEDEVKRACDKIAGNYVPWQESSF
PGLLGNVVAGRIANRLDLGGTNCVTDAACASSLSAMSMAINELALGQSDLVIAGGCDTMNDAFMY
MCFSKTPALSKSGDCRPFSDKADGTLLGEGIAMVALKRLDDAERDGDRVYAVIRGIGSSSDGRSK
SVYAPVPEGQAKALRRTYAAAGYGPETVELMEAHGTGTKAGDAAEFEGLRAMFDESGREDRQWCA
LGSVKSQIGHTKAAAGAAGLFKAIMALHHKVLPPTIKVDKPNPKLDIEKTAFYLNTQARPWIRPG
DHPRRASVSSFGFGGSNFHVALEEYTGPAPKAWRVRALPAELFLLSADTPAALADRARALAKEAE
VPEILRFLARESVLSFDASRPARLGLCATDEADLRKKLEQVAAHLEARPEQALSAPLVHCASGEA
PGRVAFLFPGQGSQYVGMGADALMTFDPARAAWDAAAGVAIADAPLHEVVFPRPVFSDEDRAAQE
ARLRETRWAQPAIGATSLAHLALLAALGVRAEAFAGHSFGEITALHAAGALSAADLLRVARRRGE
LRTLGQVVDHLRASLPAAGPAASASPAAAASVPKASTAAVPAVASVAAPGAAEVERVVMAVVAET
TGYPAEMLGLQMELESDLGIDSIKRVEILSAVRDRTPGLSEVDASALAQLRTLGQVVDHLRASLP
AASAGPAVAAPAAKAPAVAAPTGVSGATPGAAEVERVVMAVVAETTGYPAEMLGLQMELESDLGI
DSIKRVEILSAVRDRTPGLAEVDASALAQLRTLGQVVDHLRASLGPAAVTAGAAPAEPAEEPAST
PLGRWTLVEEPAPAAGLAMPGLFDAGTLVITGHDAIGPALVAALAARGIAAEYAPAVPRGARGAV
FLGGLRELATADAALAVHREAFLAAQAIAAKPALFVTVQDTGGDFGLAGSDRAWVGGLPGLVKTA
ALEWPEASCRAIDLERAGRSDGELAEAIASELLSGGVELEIGLRADGRRTTPRSVRQDAQPGPLP
LGPSDVVVASGGARGVTAATLIALARASHARFALLGRTALEDEPAACRGADGEAALKAALVKAAT
SAGQRVTPAEIGRSVAKILANREVRATLDAIRAAGGEALYVPVDVNDARAVAAALDGVRGALGPV
TAIVHGAGVLADKLVAEKTVEQFERVFSTKVDGLRALLGATAGDPLKAIVLFSSIAARGGNKGQC
DYAMANEVLNKVAAAEAARRPGCRVKSLGWGPWQGGMVNAALEAHFAQLGVPLIPLAAGAKMLLD
ELCDASGDRGARGQGGAPPGAVELVLGAEPKALAAQGHGGRVALAVRADRATHPYLGDHAINGVP
VVPVVIALEWFARAARACRPDLVVTELRDVRVLRGIKLAAYESGGEVFRVDCREVSNGHGAVLAA
ELRGPQGALHYAATIQMQQPEGRVAPKGPAAPELGPWPAGGELYDGRTLFHGRDFQVIRRLDGVS
RDGIAGTVVGLREAGWVAQPWKTDPAALDGGLQLATLWTQHVLGGAALPMSVGALHTFAEGPSDG
PLRAVVRGQIVARDRTKADIAFVDDRGSLVAELRDVQYVLRPDTARGQA"
/note="primer of  Streptococcus pneumoniae

Expected output (from regexhero.net):

RegEx

dma_k
  • 10,431
  • 16
  • 76
  • 128
  • 2
    Have a look [here](http://stackoverflow.com/questions/17043454/using-regexes-how-to-efficiently-match-strings-between-double-quotes-with-embed)... – fge Apr 14 '14 at 12:02
  • what about `String.split` and friends (most likely not split since there can be escaped delimitters)? I'm pretty sure that most JSON parsers for example are not based on regular expressions. – zapl Apr 14 '14 at 12:04
  • @fge: Thanks for the link. Indeed it is on similar matter, but I don't need to check that escaped quotes are balanced, hence the template `E(S|E)*` is not optimal for me. – dma_k Apr 14 '14 at 13:57
  • @zapl: Splitting could be a good friend – you're welcome to post your idea as an answer. Beware that input string may contain some "junk" text in-between "key=value" parts (like `nn` in example), that was the main reason to approach the problem with regex. – dma_k Apr 14 '14 at 14:00

3 Answers3

3

In order to fail in a reasonable time you need, indeed, to avoid catastrophic backtracking. This can be done using atomic grouping (?>...):

/(\w+)=(\d+|"(?>(?>""|[^"]+)+)"(?!"))

# (?>(?>""|[^"]+)+)
(?>               # throw away the states created by (...)+
    (?>           # throw away the states created by [^"]+
        ""|[^"]+
    )+
)

Your issue when using (?:""|[^"]+)+ on a string that will never match, is linked to the fact that each time you match a new [^"] character the regex engine can choose to use the inner or outer + quantifier.

This leads to a lot of possibilities for backtracking, and before returning a failure the engine has to try them all.

We know that if we haven't found a match by the time the engine reaches the end, we never will: all we need to do is throw away the backtracking positions to avoid the issue, and that's what atomic grouping is for.

See a DEMO: 24 steps on failure, while preserving the speed on the successful cases (not a real benchmarking tool, but catastrophic backtracking would be pretty easy to spot)

Robin
  • 9,415
  • 3
  • 34
  • 45
2

How about this one:

/(\w+)=("(?:[^"]|"")*"|\d+)

(Note that the / is part of the regex here. Escape it as appropriate for your host language.)

If your regex engine supports it (Java does), make the * possessive:

/(\w+)=("(?:[^"]|"")*+"|\d+)

After some debugging the latter expression can be improved to:

/(\w+)=("(?:""|[^"]*+)*+"|\d++)

Note the double *+)*+ which allows matching contiguous text in one step while not being susceptible to catastrophic backtracking.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Thanks for the hint with possessive quantifiers – that is the key for solution. Your 2nd update is exactly the [solution proposed by Robin](http://stackoverflow.com/a/23059250/267197). – dma_k Apr 14 '14 at 14:10
  • @dma_k Yes, only that he did it with atomic groups. I wanted to keep my possessive quantifier approach. :) – Tomalak Apr 14 '14 at 14:20
  • As far as I know "[atomic grouping](http://perldoc.perl.org/perlre.html#Quantifiers)" is identical to "possessive quantifiers" concept (*See the independent subexpression (?>pattern) for more details; possessive quantifiers are just syntactic sugar for that construct*). If I could I would nominate two answers. – dma_k Apr 14 '14 at 14:30
  • 1
    They are conceptually equal. But you could use sth. like `\w++`, which is shorter than `(?>\w+)`. Also, to me possessive quantifiers are clearer to read, because visually they affect the "interesting" part - the quantifier. But that's a matter of preference (and of regex engine capabilities, of course). – Tomalak Apr 14 '14 at 14:58
2

Your initial regex was already quite good, but it was more complicated than necessary, leading to catastrophic backtracking.

You should use

/(\w+)=(\d+|"(?:""|[^"])*"(?!"))

See it live on regex101.com.

Explanation:

/                # Slash
(\w+)            # Indentifier --> Group 1
=                # Equals sign
(                # Group 2:
 \d+             # Either a number
|                # or
 "(?:""|[^"])*"  # a quoted string
 (?!")           # unless another quote follows
)                # End of group 2
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thanks for the suggestion. I see that it is a tiny variation of mine, however (*unfortunately*) it crashes Java engine. It works well for PCRE and .NET, but alas, not for Java. – dma_k Apr 14 '14 at 12:17
  • @dma_k: I don't quite see how it would crash Java, since there is nothing there that could fail catastrophically (at least as far as I can see). How did you compile the regex? In Java, the escaping rules are rather complex; in this case, something like `Pattern regex = Pattern.compile("/(\\w+)=(\\d+|\"(?:\"\"|[^\"])*\"(?!\"))");` should work. Or can you post a sample string that this regex fails/crashes on? – Tim Pietzcker Apr 14 '14 at 12:34
  • I am also surprised that Java engine cannot cope with it, but that is the situation. Here is the [screeshot of code](http://i59.fastpic.ru/big/2014/0414/53/23b4750c3acf3ed4b6c23fe0802d0f53.png) run on the approx the same input as in question at the moment when `StackOverflowError` was thrown. [regexplanet](http://www.regexplanet.com/advanced/java/index.html) crashes on it as well. – dma_k Apr 14 '14 at 14:22