0

How can I change the following code so it will not care about case?

public static String tagValue(String inHTML, String tag)
            throws DataNotFoundException {
        String value = null;
        Matcher m = null;

        int count = 0;
        try {

        String searchFor = "<" + tag + ">(.*?)</" + tag + ">";

        Pattern pattern = Pattern.compile(searchFor);

         m = pattern.matcher(inHTML);


            while (m.find()) {
                count++;


                return inHTML.substring(m.start(), m.end());
                // System.out.println(inHTML.substring(m.start(), m.end()));
            }
        } catch (Exception e) {
            throw new DataNotFoundException("Can't Find " + tag + "Tag.");
        }

        if (count == 0) {
            throw new DataNotFoundException("Can't Find " + tag + "Tag.");

        }

        return inHTML.substring(m.start(), m.end());

    }
mKorbel
  • 109,525
  • 20
  • 134
  • 319
  • http://stackoverflow.com/questions/1102077/how-to-change-this-regular-expression-to-be-case-insenstive-looking-for-src-tag – Mat Sep 06 '11 at 17:52

3 Answers3

6

Give the Pattern.CASE_INSENSITIVE flag to Pattern.compile:

String searchFor = "<" + tag + ">(.*?)</" + tag + ">";
Pattern pattern = Pattern.compile(searchFor, Pattern.CASE_INSENSITIVE);
m = pattern.matcher(inHTML);

(Oh, and consider parsing XML/HTML instead of using a regular expression to match a nonregular language.)

Community
  • 1
  • 1
phihag
  • 278,196
  • 72
  • 453
  • 469
  • 1
    It won’t matter for this ASCII‐only example, but if you don’t add `UNICODE_CASE`, then you will mess up on things like Greek sigmas, since there are two lowercase versions and one uppercase version, and a case‐insensitive match of any one of them is required to match any and all three of them. – tchrist Sep 06 '11 at 18:15
1

First, read Using regular expressions to parse HTML: why not?

To answer your question though, in general, you can just put (?i) at the beginning of the regular expression:

String searchFor = "(?i)" + "<" + tag + ">(.*?)</" + tag + ">";

The Pattern Javadoc explains

Case-insensitive matching can also be enabled via the embedded flag expression (?i).

Since you're using Pattern.compile you can also just pass the CASE_INSENSITIVE flag:

String searchFor = "<" + tag + ">(.*?)</" + tag + ">";

Pattern pattern = Pattern.compile(searchFor, Pattern.CASE_INSENSITIVE);

You should know what case-insensitive means in Java regular expressions.

By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag.

It looks like you're matching tags, so you only want US-ASCII.

Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • Thanks for catching the `UNICODE_CASE` flag; please always forget this. In hindsight it would perhaps have been better if the flags had been respectively named `ASCII_CASE_INSENSITIVE` and `UNICODE_SIMPLE_CASE_INSENSITIVE`, thereby leaving open the door for an eventually `UNICODE_FULL_CASE_INSENSITIVE` for when the engine is upgrade to do full casefolding instead of the merely simple casefolding it does now. (Note that Java’s `String` methods *do* do full casemapping, unlike its `Character` methods which do only simple casemapping.) – tchrist Sep 06 '11 at 18:19
  • @tchrist, Would the difference between full and simple be things like Turkish i's that are in the supplemental (I forget the terminology from the unicode spec) case mappings? – Mike Samuel Sep 06 '11 at 19:36
  • No, that’s locale-specific casing. Full casing is when you can get back an output string that differs in length (by code point count) from that of the input. – tchrist Sep 06 '11 at 21:09
  • @tchrist, Understood. For example, ÆON vs Aeon where the first uses an ae ligature. – Mike Samuel Sep 06 '11 at 22:53
  • Actually, the casefold of `Æ` is just `æ` (one code point), although the casefold of `ffi` is `ffi` (three code points). – tchrist Sep 07 '11 at 06:37
1

You can also compile the pattern with the case-insensitive flag:

Pattern pattern = Pattern.compile(searchFor, Pattern.CASE_INSENSITIVE);
John B
  • 32,493
  • 6
  • 77
  • 98