0

I tried to split a line based on spaces not enclosed between double quotes.

My regex is

(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+

My Code

Pattern regex          = Pattern.compile("(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+");
Matcher regexMatcher   = regex.matcher(line);
List<String> rule      = new ArrayList<String>();

while(regexMatcher.find())
    rule.add(regexMatcher.group());

Input for which it is failed.

SecRule REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|REQUEST_COOKIES_NAMES|ARGS_NAMES|ARGS|XML:/* "(?i:\b(?:(?:s(?:t(?:d(?:dev(_pop|_samp)?)?|r(?:_to_date|cmp))|u(?:b(?:str(?:ing(_index)?)?|(?:dat|tim)e)|m)|e(?:c(?:_to_time|ond)|ssion_user)|ys(?:tem_user|date)|ha(1|2)?|oundex|chema|ig?n|pace|qrt)|i(?:s(null|_(free_lock|ipv4_compat|ipv4_mapped|ipv4|ipv6|not_null|not|null|used_lock))?|n(?:et6?_(aton|ntoa)|s(?:ert|tr)|terval)?|f(null)?)|u(?:n(?:compress(?:ed_length)?|ix_timestamp|hex)|tc_(date|time|timestamp)|p(?:datexml|per)|uid(_short)?|case|ser)|l(?:o(?:ca(?:l(timestamp)?|te)|g(2|10)?|ad_file|wer)|ast(_day|_insert_id)?|e(?:(?:as|f)t|ngth)|case|trim|pad|n)|t(?:ime(stamp|stampadd|stampdiff|diff|_format|_to_sec)?|o_(base64|days|seconds|n?char)|r(?:uncate|im)|an)|m(?:a(?:ke(?:_set|date)|ster_pos_wait|x)|i(?:(?:crosecon)?d|n(?:ute)?)|o(?:nth(name)?|d)|d5)|r(?:e(?:p(?:lace|eat)|lease_lock|verse)|o(?:w_count|und)|a(?:dians|nd)|ight|trim|pad)|f(?:i(?:eld(_in_set)?|nd_in_set)|rom_(base64|days|unixtime)|o(?:und_rows|rmat)|loor)|a(?:es_(?:de|en)crypt|s(?:cii(str)?|in)|dd(?:dat|tim)e|(?:co|b)s|tan2?|vg)|p(?:o(?:sition|w(er)?)|eriod_(add|diff)|rocedure_analyse|assword|i)|b(?:i(?:t_(?:length|count|x?or|and)|n(_to_num)?)|enchmark)|e(?:x(?:p(?:ort_set)?|tract(value)?)|nc(?:rypt|ode)|lt)|v(?:a(?:r(?:_(?:sam|po)p|iance)|lues)|ersion)|g(?:r(?:oup_conca|eates)t|et_(format|lock))|o(?:(?:ld_passwo)?rd|ct(et_length)?)|we(?:ek(day|ofyear)?|ight_string)|n(?:o(?:t_in|w)|ame_const|ullif)|(rawton?)?hex(toraw)?|qu(?:arter|ote)|(pg_)?sleep|year(week)?|d?count|xmltype|hour)\W*\(|\b(?:(?:s(?:elect\b(?:.{1,100}?\b(?:(?:length|count|top)\b.{1,100}?\bfrom|from\b.{1,100}?\bwhere)|.*?\b(?:d(?:ump\b.*\bfrom|ata_type)|(?:to_(?:numbe|cha)|inst)r))|p_(?:sqlexec|sp_replwritetovarbin|sp_help|addextendedproc|is_srvrolemember|prepare|sp_password|execute(?:sql)?|makewebtask|oacreate)|ql_(?:longvarchar|variant))|xp_(?:reg(?:re(?:movemultistring|ad)|delete(?:value|key)|enum(?:value|key)s|addmultistring|write)|terminate|xp_servicecontrol|xp_ntsec_enumdomains|xp_terminate_process|e(?:xecresultset|numdsn)|availablemedia|loginconfig|cmdshell|filelist|dirtree|makecab|ntsec)|u(?:nion\b.{1,100}?\bselect|tl_(?:file|http))|d(?:b(?:a_users|ms_java)|elete\b\W*?\bfrom)|group\b.*\bby\b.{1,100}?\bhaving|open(?:rowset|owa_util|query)|load\b\W*?\bdata\b.*\binfile|(?:n?varcha|tbcreato)r|autonomous_transaction)\b|i(?:n(?:to\b\W*?\b(?:dump|out)file|sert\b\W*?\binto|ner\b\W*?\bjoin)\b|(?:f(?:\b\W*?\(\W*?\bbenchmark|null\b)|snull\b)\W*?\()|print\b\W*?\@\@|cast\b\W*?\()|c(?:(?:ur(?:rent_(?:time(?:stamp)?|date|user)|(?:dat|tim)e)|h(?:ar(?:(?:acter)?_length|set)?|r)|iel(?:ing)?|ast|r32)\W*\(|o(?:(?:n(?:v(?:ert(?:_tz)?)?|cat(?:_ws)?|nection_id)|(?:mpres)?s|ercibility|alesce|t)\W*\(|llation\W*\(a))|d(?:(?:a(?:t(?:e(?:(_(add|format|sub))?|diff)|abase)|y(name|ofmonth|ofweek|ofyear)?)|e(?:(?:s_(de|en)cryp|faul)t|grees|code)|ump)\W*\(|bms_\w+\.\b)|(?:;\W*?\b(?:shutdown|drop)|\@\@version)\b|\butl_inaddr\b|\bsys_context\b|'(?:s(?:qloledb|a)|msdasql|dbo)'))" "phase:2,rev:'2',ver:'OWASP_CRS/2.2.9',maturity:'9',accuracy:'8',capture,t:none,t:urlDecodeUni,ctl:auditLogParts=+E,block,msg:'SQL Injection Attack',id:'950001',tag:'OWASP_CRS/WEB_ATTACK/SQL_INJECTION',tag:'WASCTC/WASC-19',tag:'OWASP_TOP_10/A1',tag:'OWASP_AppSensor/CIE1',tag:'PCI/6.5.2',logdata:'Matched Data: %{TX.0} found within %{MATCHED_VAR_NAME}: %{MATCHED_VAR}',severity:'2',setvar:'tx.msg=%{rule.msg}',setvar:tx.sql_injection_score=+%{tx.critical_anomaly_score},setvar:tx.anomaly_score=+%{tx.critical_anomaly_score},setvar:tx.%{rule.id}-OWASP_CRS/WEB_ATTACK/SQL_INJECTION-%{matched_var_name}=%{tx.0}

When i used this in java, some lines are separated successfully, but some lines are causing errors

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4235)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)

Sample Input:

The "world \"is beautiful" but i "cannot see" it

Expected Output:

The
"world \" beautiful"
but
i
"cannot see"
it
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Krishna M
  • 1,135
  • 2
  • 16
  • 32
  • Please post a complete (but short) program, along with the input causing the error. – Bernhard Barker Feb 11 '14 at 12:21
  • 1
    @Dukeling: It is most likely caused by a long input. Java uses recursion in matching repetition. – nhahtdh Feb 11 '14 at 12:22
  • @nhahtdh I have updated the question. How to overcome the error. Whether there is any problem in my regex – Krishna M Feb 11 '14 at 12:25
  • @KrishnaM: There are actually 2 things to fix: 1) your current regex is not working correctly 2) the stack overflow error. Need some time to test. – nhahtdh Feb 11 '14 at 12:28
  • How about splitting on all spaces and putting pieces back together where piece 1 ends with " and piece 2 starts with " in a second step? That should be feasable without nasty regexs. – Fildor Feb 11 '14 at 12:29
  • Your regex has a fundamental flaw: both terms of the alternation can match the same text, and what is more you have a quantifier. In the event of a negative match, the regex engine will try each and every combination --> stack overflow – fge Feb 11 '14 at 12:33
  • @fge: Sir can u please explain briefly ? I couldn't get you clearly. – Krishna M Feb 11 '14 at 12:37
  • @fge: It is part of the problem, but the main thing is Java use recursion to match a repetition of a non-trivial *atom*. Backtracking also adds to the problem, though – nhahtdh Feb 11 '14 at 12:37
  • @KrishnaM: Can you add the expected result? – nhahtdh Feb 11 '14 at 12:37
  • @nhahtdh: Sir, I have added a sample input and expected output – Krishna M Feb 11 '14 at 12:41
  • @KrishnaM: I asked because your original input has this case that confuses me: Before `phase:2` is a ``\"``, and you are not inside a double quoted token here. I don't know what to do in this case. – nhahtdh Feb 11 '14 at 12:45
  • How about using this one, seems to split your input alright: http://stackoverflow.com/questions/5946471/splitting-at-space-if-not-between-quotes – StoopidDonut Feb 11 '14 at 12:54

4 Answers4

2

Your regex is broken:

(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+
  #### ####### ###
  |    |       ---------------- A dot
  |    ------------------------ Any character not "
  ----------------------------- A " (no need to put it in a character class)

At this point I stopped looking further, because I am sure this is not what you want.

By the way, I recommend writing the regex first and only then do the quoting (you could write yourself a tool that does this, it is purely mechanical: add one \ before every " and every \, then enclose in ""). Also, don't use character classes for single characters.

In fact, it appears what you're looking for are words, or strings. So, why don't you say just that.

You can use a top down approach:

REGEX = (WORD|STRING)
WORD  = \w+                  -- or \p{L} or something like that
STRING = "(SOMETHING)*"
SOMETHING = \\["\\]|[^\\"]    -- an escaped quote, an escaped backslash or 
                              -- something that is neither a backslash nor a quote

Now:

  1. Replace the uppercase words by their right hand sides
  2. Remove unnecessary (), if any
  3. Quote

You can test important sub-regexes separately, for example the STRING. Turned out I had several errors in my first version, and this even when writing unquoted! To write/discuss such a regex in the form java demands from the start is virtually impossible.

Ingo
  • 36,037
  • 5
  • 53
  • 100
2

Why does StackOverflowError occur?

In reference implementation of Pattern class (which comes with Oracle's JRE, OpenJDK, and a number of other JVMs), greedy and lazy quantifiers are implemented with recursion1 when the repeated pattern is non-trivial. Therefore, you will run into StackOverflowError when the input string is long enough.

1 Recursion is a quick but not scalable solution to allow backtracking in the pattern. Better implementation uses a data structure to store backtracking points (which basically converts recursive solution to iterative solution with stack).

Solution

The following regex should work:

"(?:\"(?:[^\"\\\\]++|\\\\.)*+\"|[^ \"]++)++"

Well, the regex is quite confusing with 2 layers of escaping: escaping in Java string literal and escaping in regex syntax.

The raw regex when you print the string out. My explanation will be based on the raw regex.

(?:"(?:[^"\\]++|\\.)*+"|[^ "]++)++

Explanation

Since you only care about what the whole regex matches, all the capturing groups (pattern) has been turned into non-capturing group (?:pattern) for efficiency.

The first alternative "(?:[^"\\]++|\\.)*+" matches a quoted string.

The second alternative [^ "]++ matches a sequence of character that doesn't contain space and double quote ".

(?:
   "             # Double quote
   (?:
      [^"\\]++   # A sequence of characters that are not " and \
      |          # OR
      \\.        # Escape sequence: \ followed by any character (except line terminators)
   )*+           # Match 0 or more of the sequences above (allows empty string)
   "             # Double quote
   |
   [^ "]++
)++

Since the regex is written so that there is no needs for backtracking, all quantifiers are made possessive. Since Pattern class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers, StackOverflowError will not occur.

I remove the need for backtracking by writing the regex so that it matches the correct string on first try:

  1. Since [^"\\] excludes the \, there is no way we can "steal" a \ from an escaping sequence, or "steal" a " and mess up the closing quote, we can safely advance ahead without backtracking. This explains the possessive quantifier here [^"\\]++. There is no need to assign a quantifier here, but I do this to reduce the work on the branching.

  2. Since both [^"\\]++ and \\. can't "steal" a " and mess up the closing quote, we can advance ahead without backtracking. This explains the possessive quantifier here (?:[^"\\]++|\\.)*+

  3. [^ "] can't start a quoted string, and it also can't match a space (delimiter). This is why we can use possessive quantifier on it.

  4. Since "(?:[^"\\]++|\\.)*+" and [^ "]++ can't mess up the match of each other, we can make the outer most quantifier possessive.

Example of a regex that doesn't match things correctly on first try and only get the correct result after backtracking would be ^([bcd]+:[ab]+)+$ against inputs such as b:ab:a. The first iteration will match b:ab, which cause the 2nd iteration to fail, then it backtracks and retry with the first iteration being b:a and then successfully match the whole string.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
0

Thanks for all your responses. Atlast i found my mistake. The actual reason is for stackoverflow is not my regex. My regex was correct. I used eclipse for coding. The actual reason for stackoverflow is my stack size. Intially my stacksize was 1Mb. I increased my stack size for the program in my eclipse and there was no error.

Java stack overflow error - how to increase the stack size in Eclipse?

UPDATE:

There is no need to change the stack size. As mentioned by nhahtdh, I have changed the regex to regex with possessive quantifier and there was no stackoverflow error.

My Regex is now ("([^\\"]|\\.)++"|[^\s]++)

To learn more about Possessive quantifier follow this link.

Community
  • 1
  • 1
Krishna M
  • 1,135
  • 2
  • 16
  • 32
  • Increasing stack size is at most a patchy solution. When you have a longer string, you will hit the limit of your stack. The better solution is always to reduce the stack usage (as Jon Skeet suggested in the cited post), in this case, via the use of possessive quantifier instead of plain greedy quantifier. – nhahtdh Feb 18 '14 at 20:13
  • I can get your point. But the reason my regex failed was due to the input string length. There were more than 2000 characters enclosed between a double quote. In that case I found no other solution other than increasing the stack size. **Is there any other way for matching 2000 characters enclosed in the double quote without increasing stack size ?** – Krishna M Feb 19 '14 at 00:52
  • Instead of using greedy quantifier, rewrite the regex to use possessive quantifier (it is not a direct switch in general cases, though). I explained in my answer `Since Pattern class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers, StackOverflowError will not occur.` – nhahtdh Feb 19 '14 at 04:23
  • And another thing. Your regex is incorrect. Ingo's post has pointed out why it is incorrect. Your regex does not parse escaped double quote `\"` correctly. – nhahtdh Feb 19 '14 at 06:54
  • @nhahtdh : I have changed the regex now and there is no stackoverflow error. Thanks for pointing my mistake. And there is no \" in my expression. I later learned that \ is used to denote that the rule continues in next line also. I have updated that in my problem statement. – Krishna M Feb 21 '14 at 07:08
  • You can drop the outer most `()`. If you pass the string `"(\"([^\\\\"]|\\\\.)++\"|[^\\s]++)"` to `Pattern.compile`, then it somewhat makes sense (just that I am not sure if you want to match an unpaired double quote). – nhahtdh Feb 21 '14 at 07:35
  • Whether extra () will affect the performance ? I will change it anyways – Krishna M Feb 21 '14 at 08:40
  • It will introduce some redundancy, so there will be performance hit, but I don't think it is noticeable. – nhahtdh Feb 21 '14 at 08:55
  • @nhahtdh: As I have updated the answer, is it possible to remove negative vote in my answer ? – Krishna M Feb 21 '14 at 08:58
-2

The first thing to try is to increase the stack size.

If that does not work, you might have hit a bug. You could try a different JVM and set up the JVM to use something other than OpenJDK for the class library, and tinker with the regex to see exactly what is triggering it.

Community
  • 1
  • 1
Miserable Variable
  • 28,432
  • 15
  • 72
  • 133
  • Different JVM won't fix the problem, as long as they are using the same code base for Java Class Library. Increasing the stack size should be the last resort. – nhahtdh Feb 18 '14 at 20:15
  • @nhahtdh can you explain why you are so sure that a bug in JVM cannot cause such problems? – Miserable Variable Feb 18 '14 at 21:53
  • You can say that it is a (long-running and a design) bug in Pattern class (which if part of Java Class Library) and I really doubt it is going to be fixed any time soon. Since Pattern uses recursion to match greedy and lazy quantifiers, it will run into StackOverflowError for large enough input. However, possessive quantifiers implements matching with a loop (since it does not allow backtracking), there will be no problem. As long as all the JVM uses the same implementation of Pattern class, you won't run away from the problem. – nhahtdh Feb 19 '14 at 04:26
  • Hmmm, you are referring to specific case of `Pattern` and related classes while I was referring to the general problem of `StackOverflowException` in standard classes. I don't think that merits a downvote, but thanks for you explanation. – Miserable Variable Feb 19 '14 at 05:02
  • I was wrong when I say all JVM uses the same implementation of Pattern class. There **are** some implementation of JVM that comes with their own implementation of Java Class Library (instead of Sun's/Oracle's). Just that there is no guarantee that they follow the reference implementation (some features may be absent). On a quick inspection, GNU Classpath seems to have good support and it uses data structure to backtrack instead of stack. – nhahtdh Feb 19 '14 at 06:41
  • @nhahtdh Once again, I think you are mistaking the currently available implementations for what are the theoretical possibilities. What I am saying is that when you get a `StackOverflowException` in a class in `JDK` you should try a different version of JDK. If you do not agree, I doubt I can convince you. – Miserable Variable Feb 19 '14 at 10:34
  • @nhahtdh initially you said this bug exists in *all* JVMs, which I cannot imagine how you can claim since it is theoretically not possible to know of all of them. Now you are saying it is Oracle JVMs. I think I have understood your point in both cases, why do you think I haven't? – Miserable Variable Feb 19 '14 at 15:21
  • 1
    This is bad advice anyway. The **first** thing one should try after getting a StackOverflowException is write less inefficient code. It's not the regex *engine* that's blowing the stack here, it's the regex *author*. – Alan Moore Mar 07 '14 at 06:22