0

I have a String like

String str = "305556710S  or 100596269C OR CN111111111";

I just want to match the characters in this string that start with numbers or start with numbers and end with English letters, Then prefix the matched characters add with two "??" characters. I write a Patern like

    Pattern pattern = Pattern.compile("^[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(str);
    while (matcher.find()) {
        int start = matcher.start();
        int end = matcher.end();
        String matchStr = matcher.group();
        System.err.println(matchStr);
    }

But it can only match the first character "305556710S". But If I modify the Pattern

 Pattern pattern = Pattern.compile("[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);

It will matches "305556710S","100596269C","111111111".But the prefix of "111111111" is English character "CN" which is not my goal. I only want match the "305556710S" and "100596269C" and add two "??" characters before the matched Characters.Can somebody help me ?

jaco0646
  • 15,303
  • 7
  • 59
  • 83
Ming
  • 71
  • 5
  • Does [this](https://stackoverflow.com/questions/1324676/what-is-a-word-boundary-in-regex) answer your question? – Sweeper Feb 25 '20 at 07:15
  • No,I want to only match the Character which start with number and end with english character,not start with english characters. – Ming Feb 25 '20 at 07:20

2 Answers2

1

I think you need to use word boundaries \b. Try this changed pattern:

"\\b[0-9]{1,10}[A-Z]{0,1}\\b"

This prints out:

305556710S
100596269C

Why it works:

  1. The difference here is that it will check only those character sequences that are within a pair of word boundaries. In the earlier pattern you used, a character sequence even from the middle of a word may be used to match against the pattern due to which even 11111... from CN1111... was matched against the pattern and it passed.
  2. A word boundary also matches the end of the string input. So, even if a candidate word appears at the end of the line, it will get picked up.

If more than one English alphabet can come at the end, then remove the max occurrence indicator, 1 in this case:

"\\b[0-9]{1,10}[A-Z]{0,}\\b"
Sree Kumar
  • 2,012
  • 12
  • 9
1

First, you should avoid the ^ in this particular regexp. As you noticed, you can't return more than one result, as "^" is an instruction for "match the beginning of the string"

Using \b can be a solution, but you may get invalid results. For example

305556710S or -100596269C OR CN111111111

The regexp "\\b[0-9]{1,10}[A-Z]{0,}\\b" will match 100596269C (because the hyphen is not word character, so there is a word boundary between - and 1)

The following regexp matches exactly what you want: all numbers, that may be followed by some English chars, either at the beginning of the string or after a space, and either followed by a space or at the end of the string.

(?<=^| )[0-9]{1,10}[A-Z]*(?= |$)

Explanations:

  1. (?<=^| ) is a lookbehind. It makes sure that there is either ^ (string start) or a space behind actual location. Note that lookbehinds don't add matching chars to the result: the space won't be part of the result
  2. [0-9]{1,10}[A-Z]* matches digits (at least one, up to ten), then one or more letters.
  3. (?= |$) is a lookahead. It makes sure that there will be either a space or $ (end of string) after this match. Like lookbehinds, the chars aren't added to the results and position remains the same : the space read here for example can also be read by the lookbehind of the next captured string

Examples : 305556710S or 100596269C OR CN111111111 matches: at index 0 [305556710S], at index 15 [100596269C]; 100596269C123does not match.

David Amar
  • 247
  • 1
  • 5
  • Wow, I did meet the problem you said. I think your answer is more comprehensive. Thank you very much for your answers – Ming Feb 25 '20 at 07:52