The following regex removes lists and menus from webpage text
String s = fileContent.replaceAll("(([A-Za-z&—:\\-\\/\\d ])*(\\n|\\r|\\r\\n)){5,}","");
It has worked no problem on tens of thousands of files. Today, it gave me a stackoverflow:
java.lang.StackOverflowError at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251) at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251) at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251) at java.util.regex.Pattern$5.isSatisfiedBy(Pattern.java:5251) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3776) at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4435) at java.util.regex.Pattern$GroupCurly.match(Pattern.java:4405) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658) at java.util.regex.Pattern$Loop.match(Pattern.java:4785) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4717) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4717) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4568) at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3798) at java.util.regex.Pattern$Branch.match(Pattern.java:4604) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658) at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4485) at java.util.regex.Pattern$GroupCurly.match(Pattern.java:4405) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4658) at java.util.regex.Pattern$Loop.match(Pattern.java:4785)
The file it was parsing has hundreds of consecutive \r\n. Other than that, I can't see anything unusual. Can someone advise as to what aspect of the expression and/or the java internal regex parsing caused the error?