1

I am writing a String parser that I use to parse all strings from a text file, The strings can be inside single or double quotes, Pretty simple right? well not really. I wrote a regex to match strings how I want. but it's giving me StackOverFlow error on big strings (I am aware java isn't really good with regex stuff on big strings), This is the regex pattern (['"])(?:(?!\1|\\).|\\.)*\1

This works good for all the string inputs that I need, but as soon as theres a big string it throws StackOverFlow error, I have read similar questions based on this, such as this which suggests to use StringUtils.substringsBetween, but that fails on strings like '""', "\\\""

So my question is what should I do to solve this issue? I can provide more context if needed, Just comment.

Edit: After testing the answer

Code:

public static void main(String[] args) {

    final String regex = "'([^']*)'|\"(.*)\"";
    final String string = "local b = { [\"\\\\\"] = \"\\\\\\\\\", [\"\\\"\"] = \"\\\\\\\"\", [\"\\b\"] = \"\\\\b\", [\"\\f\"] = \"\\\\f\", [\"\\n\"] = \"\\\\n\", [\"\\r\"] = \"\\\\r\", [\"\\t\"] = \"\\\\t\" }\n" +
            "local c = { [\"\\\\/\"] = \"/\" }";

    final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
    final Matcher matcher = pattern.matcher(string);

    while (matcher.find()) {
        System.out.println("Full match: " + matcher.group(0));
        for (int i = 1; i <= matcher.groupCount(); i++) {
            System.out.println("Group " + i + ": " + matcher.group(i));
        }
    }
}

Output:

Full match: "\\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t"
Group 1: null
Group 2: \\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t
Full match: "\\/"] = "/"
Group 1: null
Group 2: \\/"] = "/

It's not handling the escaped quotes correctly.

SamHoque
  • 2,978
  • 2
  • 13
  • 43

2 Answers2

0

For the overflow state, you would probably want to allocate whatever resources that'd be required. You'd likely want to design small benchmark tests and find out about the practical resources that might be necessary to finalize your task.

Another option would be to find some other strategies or maybe languages to solve your problem. For instance, if you could classify your strings into two categories of ' or " wrapped to find some other optimal solutions.

Otherwise, you might want to try designing simple expressions and avoid back-referencing, such as with:

'([^']*)'|"(.*)"

which would probably fail for some other inputs that you might have and we don't know of.

Or maybe present your question slightly more technical such that some experienced users might be able to provide better answers, such as this answer.

Test

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class RegularExpression{

    public static void main(String[] args){

        final String regex = "'([^']*)'|\"(.*)\"";
        final String string = "'\"\"'\n"
             + "\"\\\\\\\"\"";

        final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
        final Matcher matcher = pattern.matcher(string);

        while (matcher.find()) {
            System.out.println("Full match: " + matcher.group(0));
            for (int i = 1; i <= matcher.groupCount(); i++) {
                System.out.println("Group " + i + ": " + matcher.group(i));
            }
        }

    }
}

Output

Full match: '""'
Group 1: ""
Group 2: null
Full match: "\\\""
Group 1: null
Group 2: \\\"

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Emma
  • 27,428
  • 11
  • 44
  • 69
  • Hmm, This regex does break my parser, I'll see what string isn't working on this and reprot back, But: StackOverFlow is gone and strings are matching (even big ones). – SamHoque Oct 11 '19 at 02:49
  • Okay so the problem with the regex is that It's treating `\\"` as an escaped quote but it's an escaped backslash not an escaped quote, I think that's the only problem right now with my inputs, other seems to work fine. – SamHoque Oct 11 '19 at 03:00
  • 1
    Thanks for your efforts :) – SamHoque Oct 11 '19 at 23:03
0

I would try without capture quote type/lookahead/backref to improve performance. See this question for escaped characters in quoted strings. It contains a nice answer that is unrolled. Try like

'[^\\']*(?:\\.[^\\']*)*'|"[^\\"]*(?:\\.[^\\"]*)*"

As a Java String:

String regex = "'[^\\\\']*(?:\\\\.[^\\\\']*)*'|\"[^\\\\\"]*(?:\\\\.[^\\\\\"]*)*\"";

The left side handles single quoted, the right double quoted strings. If either kind overbalances the other in your source, put that preferably on the left side of the pipe.

See this a demo at regex101 (if you need to capture what's inside the quotes, use groups)

bobble bubble
  • 16,888
  • 3
  • 27
  • 46