0

I have both used this Java specific regex tester and Regex101's tester and both can find all 4 matches of a new-line starting with # like in this sample string below. The string data comes from an UTF-8 file.

#1

#2

#3

#4

But only #2, #3, #4 is a positive match when running the Java code below in Android. Edit: I have found it that putting a empty line above #1 gets it matched which explains why all others get matched since they all have empty lines above them

Java code:

Pattern pattern = Pattern.compile("^#.*", Pattern.CASE_INSENSITIVE | 
Pattern.MULTILINE);
    Matcher matcher = pattern.matcher(text);

    while (matcher.find()) {
        for (int i = 0; i <= matcher.groupCount(); i++) {
            String foundWord = matcher.group(i);
       }
    } 

It's like the matcher.find() is completely skipping the first line

  • 1
    You can use `^[\t ]*#.*` to account for random whitespace which could be present before the `#` – MonkeyZeus Aug 20 '19 at 17:49
  • @MonkeyZeus I tried to use it but It was the same as the other pattern –  Aug 20 '19 at 18:15
  • @MonkeyZeus It was due to an white space before #. –  Aug 20 '19 at 18:25
  • What "other" pattern is it the same as? My pattern is different than yours because it accounts for leading whitespaces and this would solve your Java issue. – MonkeyZeus Aug 20 '19 at 19:08
  • @MonkeyZeus nvm my previous comments. Your suggested pattern is working and in use now, but isn't fixing the issue. I actually found the real problem causing number #1 line to not being matched. See my edited question –  Aug 20 '19 at 22:14
  • I don't understand your issue fully. If you have bad data then you need to either fix up the data beforehand or build out a regex which can work around it. If you want useful help then provide a useful example. You should add actual sample of data with tainted extras and state what you expect to extract from it. I really don't feel like playing some wild goose chase. – MonkeyZeus Aug 20 '19 at 22:39
  • My guess is that the `text` comes from a text file written in UTF-8, and that the file starts with a BOM. Since the BOM is ignored by Java and is invisible, the first line starts with the BOM character, and hence `^#` will not match. To fix, search for [`java remove bom`](https://stackoverflow.com/search?q=java+remove+bom), but the easiest is to save the `.sql` file without a BOM, e.g. by using Notepad++ to remove it, see https://stackoverflow.com/a/28664104/5221149. – Andreas Aug 20 '19 at 23:03
  • 1
    @Andreas Yes you're completely right. My text does come from an UTF-8 file and is loaded into a `String` variable. I will look into this "bom" which I never encounted before –  Aug 21 '19 at 06:35

1 Answers1

2

Maybe you have a empty space before #1 or another char. Your code has the desired output

public static void main(String[] args) {
    String text = "#1 \n" +
            "#2\n" +
            "#3\n" +
            "#4\n" +
            "enter code here";

    Pattern pattern = Pattern.compile("^#.*", Pattern.CASE_INSENSITIVE |  Pattern.MULTILINE);
    Matcher matcher = pattern.matcher(text);

    while (matcher.find()) {
        System.out.println(matcher.group());
    }
}

Output

#1 
#2
#3
#4
Butiri Dan
  • 1,759
  • 5
  • 12
  • 18
  • It was due to an empty space before `#1` as you said. The text is just an example. The real content is lager and is coming from a Json string via an API. So guess I have to clean it for whitespaces before a new line or use a pattern that ignores that –  Aug 20 '19 at 18:19
  • @Muddz Great! you can try with `"^[\\s#].*"`, base on [Pattern doc](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) **\s** mach a whitespace character: [ \t\n\x0B\f\r]" – Butiri Dan Aug 20 '19 at 18:26
  • I just found out that it was not because of a possible white space before a new `#` line that was causing the problem. See my edited question please. –  Aug 20 '19 at 22:16
  • I'm sorry, maybe is a null character or a special one like `\0` `\r` `\n` `\r\n` `\f`. You can try with `"^[\0#].*"` – Butiri Dan Aug 20 '19 at 23:24