0

I have a code snippet to convert an input stream into a String. I then use java.util.regex.Matcher to find something inside the string.

The following works for me:

StringBuilder sb = new StringBuilder();
InputStream ins; // the InputStream data
BufferedReader br = new BufferedReader(new InputStreamReader(ins));
br.lines().forEach(sb::append);
br.close();

String data = sb.toString();
Pattern pattern = Pattern.compile(".*My_PATTERN:(.*)");
Matcher matcher = pattern.matcher(data);
if (matcher.find())
   String searchedStr = matcher.group(1); // I find a match here

But if I try to replace BufferedReader with Apache IOUtils, I do not find any matches with the same string.

InputStream ins; // the InputStream data
String data = IOUtils.toString(inputStream, StandardCharsets.UTF_8);

Pattern pattern = Pattern.compile(".*My_PATTERN:(.*)");
Matcher matcher = pattern.matcher(data);
if (matcher.find())
   String searchedStr = matcher.group(1); // I cannot find a match here

I have tried with other "StandardCharsets" apart from UTF-8 but none have worked.

I am unable to understand what is different here that would cause IOUtils to not match. Can someone kindly help me out here?

mang4521
  • 742
  • 6
  • 22
  • Please include an example with a string literal. – erip Nov 09 '22 at 12:22
  • 2
    no idea what `IOUtils` does, but the first snippet is *removing* newlines; `IOUtils` probably not – user16320675 Nov 09 '22 at 12:54
  • @user16320675 what inside the first snippet is responsible for removal of new lines? – mang4521 Nov 09 '22 at 13:11
  • @erip it can be the content of any webpage [too big to share it here]. For the same content, the first snippet can find the search string where as the second one doesn't. – mang4521 Nov 09 '22 at 13:12
  • Yes, but we can't debug "any webpage" so we need an example that you've observed or can manufacture. – erip Nov 09 '22 at 13:20
  • line brakes will be the problem. Remove line brakes or try this pattern: "(?m).*?My_PATTERN:(.*)" – szeak Nov 09 '22 at 13:25
  • @erip I have attached the string for which the search is failing. As quoted above, the newline character might have something to do with this. – mang4521 Nov 09 '22 at 13:25
  • @szeak I tried this pattern```pattern = Pattern.compile("(?m).*?HTTPSTATUS:(.*)");```. The line breaks were not removed ("\n") and the find() did not work with the second solution. Is there a waay to remove line breaks? – mang4521 Nov 09 '22 at 13:30
  • 1
    `.lines()` will *retrieve* each line, using newline or line break as separator, EXCLUDING it from the returned line (try it yourself `new BufferedReader(new StringReader("PATTERN:\nnext line")).lines().collect(Collectors.joining())`) – user16320675 Nov 09 '22 at 13:38
  • @mang4521 try with "(?sm)..." or remove line brakes before matching: data = data.replaceAll("\\r?\\n", ""); – szeak Nov 09 '22 at 14:04

1 Answers1

1

The first code removes line brakes, the second doesn't.

So you should define multiline pattern matching:

  1. In the pattern (starting with flags s=dotall, m=multiline)
Pattern pattern = Pattern.compile("(?sm).*My_PATTERN:(.*)");
  1. In the pattern v2
Pattern pattern = Pattern.compile("[\\s\\S]*My_PATTERN:([\\s\\S]*)");
  1. With flags
Pattern pattern = Pattern.compile(".*My_PATTERN:(.*)", MULTILINE|DOTALL);

All matches line brakes in the group's value.
Or remove line breaks ie:

data = data.replaceAll("\\r?\\n", "");

See: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile(java.lang.String,%20int)

https://docs.oracle.com/javase/tutorial/essential/regex/pattern.html

szeak
  • 325
  • 1
  • 7
  • Thank you. This works as expected. Do have a follow up question. Each pattern defined above [*1.*, *2.* etc], I would be able apply each one of them individually to exclude line breaks, correct? – mang4521 Nov 09 '22 at 14:14
  • @mang4521 The two flag solution should work in the same way. In theory, [\s\S] should give the same result. – szeak Nov 09 '22 at 14:22
  • @mang4521 Using the flags does not remove or exclude line brakes. But include line brakes in matching, it means the dot '.' matches line brake characters also. – szeak Nov 09 '22 at 14:42