1

Need to extract mobile numbers based on multiple words(LandLine|Mobile) scan from the below input. I am not able to extract all the 3 numbers. Need to read the number before and after the given words combination .Please assist

Words: (LandLine|Mobile)

    String line = "i'm Joe my LandLine number is 987654321, another number 123456789 is my Mobile and wife Mobile number is 776655881";
            
    String pattern = "(Mobile|LandLine)([^\\d]*)(\\d{9})|"  //Forward read
                    +"(\\d{9})([^\\d]*)(Mobile|LandLine)";  //Backward read
    
    Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);

    Matcher matcher = r.matcher(line);
    while(matcher.find()) {
        System.out.println(line.substring(matcher.start(), matcher.end()));
        
    }
Code Output:
LandLine number is 987654321
123456789 is my Mobile and wife Mobile
Expected Output:
LandLine number is 987654321
123456789 is my Mobile
Mobile number is 776655881
Sadu
  • 97
  • 8

1 Answers1

2

The pattern "(LandLine|Mobile)\\D*\\d{9}|\\d{9}.*?(LandLine|Mobile)" seems to fit the bill:

import java.util.Arrays;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;

class Main {
    public static void main(String[] args) {
        var line = "i'm Joe my LandLine number is 987654321, another number 123456789 is my Mobile and wife Mobile number is 776655881";
        var pattern = "(LandLine|Mobile)\\D*\\d{9}|\\d{9}.*?(LandLine|Mobile)";
        var res = Pattern
            .compile(pattern)
            .matcher(line)
            .results()
            .map(MatchResult::group)
            .toArray(String[]::new);
        System.out.println(Arrays.toString(res));
    }
}

Output:

[LandLine number is 987654321, 123456789 is my Mobile, Mobile number is 776655881]

This adds a lazy quantifier ? to .*? along with some minor semantic optimizations like \\D instead of [^\\d].

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • `@ggorlen` , Thank you very much.. It works as expected, could you please explain in detail how you did it using .*? – Sadu Sep 28 '20 at 03:55
  • No problem. See [this](https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions) which explains it better than I could. – ggorlen Sep 28 '20 at 03:57
  • `@ggorlen` i modified the input and tried. Somewhere the logic breaks. .. New Input String = "i'm Joe my LandLine number is 987654321, another number 123456789 is my Mobile. I stay in zipcode 888888888 and wife Mobile is 776655881." ............ Got output as ```LandLine number is 987654321 123456789 is my Mobile 888888888 and wife Mobile ``` ......Expected output : ``` LandLine number is 987654321 123456789 is my Mobile Mobile is 776655881 ``` . It's messing with Zipcode as it also same as 9 digits. – Sadu Sep 28 '20 at 04:11
  • Sounds like you have a complex language processing task. Regex isn't a very good tool for this sort of thing because if the domain isn't well-defined, there's a strong risk of running into an endless stream of edge cases like this. There's no real solution to this problem using regex which simply isn't smart enough to know that the `888888888` is not associated with `Mobile` and that `123456789` should be. You could hardcode a way around that one, but it'll fail elsewhere on unstructured text. Use a natural language processing library and analyze the sentences to determine their meaning. – ggorlen Sep 28 '20 at 04:34