0

i'm looking for a way to check whether a multiline string (from a pdf) contains a certain letter combination which must not start with a specific prefix. Specifically, i'm trying to find Strings that contain ARC but don't contain NON-ARC.

I found this great example Regular expression for a string that does not start with a sequence but it seems it does not work with my problem. With my pattern ^(?!NON\\-)ARC.* i get the expected result in a single line test, with real input the negative look ahead assertion has a false positive. Here is what i did:

@Test
public void testRegexLookAhead() {
    String strTestSimplePos = "ARC 0.1-1";
    String strTestSimpleNeg = "NON-ARC 3.4-1";

    String strTestRealPos = "HEADLINE\r\n" + "Subheader Author\r\n" + "ARC 0.1-1\r\n" + "20190211";
    String strTestRealNeg = "HEADLINE\r\n" + "Subheader Author\r\n" + "NON-ARC 0.1-1\r\n" + "20190211";

    //based on https://stackoverflow.com/questions/899422/regular-expression-for-a-string-that-does-not-start-with-a-sequence
    String regexNoNON = "^(?!NON\\-)ARC.*";               

    Pattern noNONPatter = Pattern.compile(regexNoNON);

    System.out.println(noNONPatter.matcher(strTestSimplePos).find()); //true OK 
    System.out.println(noNONPatter.matcher(strTestSimpleNeg).find()); //false OK
    System.out.println(noNONPatter.matcher(strTestRealPos).find()); //false but should be true -> does not work as intended
    System.out.println(noNONPatter.matcher(strTestRealNeg).find()); //false OK 

Would be glad if anyone can point out what went wrong...

Edit: This was marked as a duplicate of How to use java regex to match a line - however i didn't try to use a regex to match a line at all. Just needed a way to find a specific sequence (with negative look-ahead) for a multiline text input. One approach to solve the other question is also the solution to this one (compile pattern with java.util.regex.Pattern.MULTILINE) - but the questions are at best related.

ptstone
  • 478
  • 7
  • 17

2 Answers2

1

Try this Regex:

HEADLINE(?:(?!HEADLINE)[\s\S])*(?<!NON-)ARC(?:(?!HEADLINE)[\s\S])*

Click for Demo

JAVA Code

Explanation:

  • HEADLINE - matches the word HEADLINE
  • (?:(?!HEADLINE)[\s\S])* - matches 0+ occurrences of any character that does not start with the word HEADLINE
  • (?<!NON-)ARC - matches the word ARC if it is not immediately preceded by NON-
  • (?:(?!HEADLINE)[\s\S])* - matches 0+ occurrences of any character that does not start with the word HEADLINE
Gurmanjot Singh
  • 10,224
  • 2
  • 19
  • 43
  • 1
    Thank you very much, tried your demo. However its not exactly the solution i was looking for, headline could be any headline, the number of lines of text is inconsistent (pulled from pdf rectangle area). Its really about finding all headers with containing ARC (but not NON-ARC). I think the solution by ernest_k is what i was looking for... – ptstone Feb 11 '19 at 06:11
1

Your input strings have multiple lines and you're using the caret, you need to add the multi-line flag:

Pattern.compile(regexNoNON, java.util.regex.Pattern.MULTILINE);

About MULTILINE:

Enables multiline mode.

In multiline mode the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. By default these expressions only match at the beginning and the end of the entire input sequence.

Community
  • 1
  • 1
ernest_k
  • 44,416
  • 5
  • 53
  • 99
  • 1
    Thank you, did not use multiline patterns so far & was unaware of this constructor. This solved my problem. – ptstone Feb 11 '19 at 06:12