2

I am trying to make a regular expression for HTML tags. The regex I've created so far is <(/?)(\w+?)(\s(.*?))*?((/>)|>), when I tested it online it worked perfectly; but when I tested it using Java regex it sometimes throws StackOverFlowError and sometimes it doesn't.

I'am using this code for testing :

public static void parseHtml(String urlString){
    new Thread(new Runnable() {
        @Override
        public void run() {
            int count = 0;
            int count2 = 0;
            String htmlScript = downloadWebPage(urlString);
            Matcher matcher = Pattern.compile("<(/?)(\\w+?)(\\s(.*?))*?((/>)|>)",
                                              Pattern.DOTALL).matcher(htmlScript);
            while(matcher.find()) {
                System.out.println(matcher.group());
            }
        }
    }).start();
}

So, my question is : Why does Java's regex engine throws StackOverFlowError sometimes and sometimes it doesn't?

Note: I used the same test input (The same URL), and it threw the error, and tested it again later and it worked nicely.

logi-kal
  • 7,107
  • 6
  • 31
  • 43
Tarek Mohamed
  • 81
  • 1
  • 8
  • 2
    Did you read [this post](https://stackoverflow.com/questions/214741/what-is-a-stackoverflowerror)? Errors about memory space, like `StackOverflowError` or `OutOfMemoryException`, are generally unpredictable, and depend on cleanup processes. – MC Emperor Jul 01 '17 at 16:56
  • with what tag/file are you getting StackOverFlowError? could you post your test file? – elmigue017 Jul 01 '17 at 17:00
  • Thank you for your comment and no, but I have read many posts about regex and how does it work, and I 've tried to make the regex as greedy as possible to minimize the recursion processes, also I don't use any additional recursion calls beside the regex's engine, Am I missing something here ? – Tarek Mohamed Jul 01 '17 at 17:02
  • @MiguelTorres I am passing a web address to the method ex : [Link](https://stackoverflow.com/questions/13684468/java-runnable-run-method-returning-a-value) – Tarek Mohamed Jul 01 '17 at 17:05
  • @MCEmperor so is there a way to minimize the possibility of this error ? – Tarek Mohamed Jul 01 '17 at 17:11
  • @TarekMohamed Perhaps. But can you post the stacktrace when the exception occurs? It might lead you to the problem line. – MC Emperor Jul 01 '17 at 17:14
  • @MCEmperor it happens randomly, when I tested it right now using [This link](https://stackoverflow.com/questions/13684468/java-runnable-run-method-returning-a-value), it worked nicely. – Tarek Mohamed Jul 01 '17 at 17:21
  • Is the web page content always the same, or do you get different results? Did you try to print the htmlScript length to verify if you always get the same result from the web page? – Christoph Bimminger Jul 01 '17 at 17:26
  • @ChristophBimminger I copied the script to [This site](http://regexr.com/) and compared the results of my regex with `<` and it returned the same number of matches -1 ( the ` `) – Tarek Mohamed Jul 01 '17 at 17:32
  • This `(\s(.*?))*?` part looks very inefficient since `.` also matches in part what `\s` matches. You might want to replace it with something like `(\s+[^\s=]+=(?:"[^"]*"|'[^']*'|\S+))*?`. However, tags and any other mark up languages should not be parsed with regex. There are parsers (or need to be written) for that. – Wiktor Stribiżew Jul 01 '17 at 17:42
  • @WiktorStribiżew thank you for your comment, but in `<(/?)(\w+?)(\s(.*?))*?((/>)|>)` when tested on [This link](http://regexr.com/) using the html from [This link](https://stackoverflow.com/questions/13684468/java-runnable-run-method-returning-a-value), it works fine, but when I removed the `\s` from the `(\s(.*?))*?` part it gave me a timeout error (Took longer than 250 ms to execute) – Tarek Mohamed Jul 01 '17 at 17:51
  • What I mean is *do not* use `<(/?)(\w+?)(\s(.*?))*?((/>)|>)`. It is a very poor pattern thaty will lead to issues. – Wiktor Stribiżew Jul 01 '17 at 17:56

2 Answers2

0

I think Java notoriously doesn't like alternations in certain circumstances
where there are potential backtrack problems.

So, this part (\s(.*?))*? creates an undo burden on the backtrack
mechanism.

 (                             # (3 start)
      \s
      ( .*? )                       # (4)
 )*?                           # (3 end)

Where the end result is a nested optional quantifiers.
It can be reduced to ([\S\s]*?) without having the nesting issues.

Also, this part ((/>)|>) can be reduced to (/?>) eliminating the need
for another stack frame via the alternation.

On a whole, you don't really need the capture groups.


If you just need to pars the tags, which is the initial level of html
parsing, then using regex is ok.

If you want to do more than parse individual tags, a DOM parser is needed.

I find this regex will parse all individual html/xml tags.

https://regex101.com/r/YXhCxe/1

"<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\\s+(?>\"[\\S\\s]*?\"|'[\\S\\s]*?'|(?:(?!/>)[^>])?)+)?\\s*>)[\\S\\s]*?</\\1\\s*(?=>))|(?:/?[\\w:]+\\s*/?)|(?:[\\w:]+\\s+(?:\"[\\S\\s]*?\"|'[\\S\\s]*?'|[^>]?)+\\s*/?)|\\?[\\S\\s]*?\\?|(?:!(?:(?:DOCTYPE[\\S\\s]*?)|(?:\\[CDATA\\[[\\S\\s]*?\\]\\])|(?:--[\\S\\s]*?--)|(?:ATTLIST[\\S\\s]*?)|(?:ENTITY[\\S\\s]*?)|(?:ELEMENT[\\S\\s]*?))))>"

Expanded

 <
 (?:
      (?:
           (?:
                # Invisible content; end tag req'd
                (                             # (1 start)
                     script
                  |  style
                  |  object
                  |  embed
                  |  applet
                  |  noframes
                  |  noscript
                  |  noembed 
                )                             # (1 end)
                (?:
                     \s+ 
                     (?>
                          " [\S\s]*? "
                       |  ' [\S\s]*? '
                       |  (?:
                               (?! /> )
                               [^>] 
                          )?
                     )+
                )?
                \s* >
           )

           [\S\s]*? </ \1 \s* 
           (?= > )
      )

   |  (?: /? [\w:]+ \s* /? )
   |  (?:
           [\w:]+ 
           \s+ 
           (?:
                " [\S\s]*? " 
             |  ' [\S\s]*? ' 
             |  [^>]? 
           )+
           \s* /?
      )
   |  \? [\S\s]*? \?
   |  (?:
           !
           (?:
                (?: DOCTYPE [\S\s]*? )
             |  (?: \[CDATA\[ [\S\s]*? \]\] )
             |  (?: -- [\S\s]*? -- )
             |  (?: ATTLIST [\S\s]*? )
             |  (?: ENTITY [\S\s]*? )
             |  (?: ELEMENT [\S\s]*? )
           )
      )
 )
 >
  • 1
    Thanks for that answer, it was really helpful, I made some modifications on my regex based on your answer and this is what I got so far `<(/?)(\w+?)([\S\s]*?)(/?>)`, I am using the Parentheses for testing groups. – Tarek Mohamed Jul 01 '17 at 22:45
-1

In base of your input I tested and it works fine, I can't reproduce the error, here is the implementation:

public static void main(String[] args) {
    new Thread(new Runnable() {
        @Override
        public void run() {
            String htmlScript = downloadWebPage("https://stackoverflow.com/questions/13684468/java-runnable-run-method-returning-a-value");
            Matcher matcher = Pattern.compile("<(/?)(\\w+?)(\\s(.*?))*?((/>)|>)",
                                              Pattern.DOTALL).matcher(htmlScript);
            while(matcher.find()) {
                System.out.println(matcher.group());
            }
        }
    }).start();

}

private static String downloadWebPage(String urlString) {
    StringBuilder sb = new StringBuilder();
    try {
        URL u = new URL(urlString);
        BufferedReader in = new BufferedReader(new InputStreamReader(u.openStream()));

        String inputLine;
        while ((inputLine = in.readLine()) != null) {
            sb.append(inputLine);
        }
        in.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
    return sb.toString();
}

Here is the output: https://pastebin.com/s9DbBVBJ

elmigue017
  • 343
  • 2
  • 12