0

Need your kind help in matching some numbers patterns using regex. I have thousands of 10-digit numbers in the following patterns and need to extract them according to their patterns.

Note - I don't need spaces in between.

Pattern 1: Number: 3527 432 432 Let's consider the above number as ABCD XYZ XYZ pattern

Pattern 2: Number: 3527 89 89 89 ABCD XY XY XY

Pattern 3: Number: 35 35 35 8745 XY XY XY ABCD

Pattern 4: 5432 888 999 ABCD XXX YYY

Pattern 5: 5432 8888 99 ABCD XXXX YY

Pattern 6 5432 33 44 22 ABCD XX YY ZZ

Any kind of help in any manner is much appreciated.

I am a complete beginner in regex and know basic things.

For patterns like 5432 8888 99

I am using regex like \d\d\d\d8888\d\d

and then manually find the matching numbers from the list by changing 8888 to other digits like 1111.

InSync
  • 4,851
  • 4
  • 8
  • 30
rahul
  • 19
  • 3
  • Read about [backreferences](https://stackoverflow.com/questions/21880127/have-trouble-understanding-capturing-groups-and-back-references). Your first pattern would be something like `\d{4}(\d{3})\1` or `\d{4}\s(\d{3})\s\1` (it's unclear if you need spaces). – markalex Jun 29 '23 at 16:36
  • I don't need spaces. And thank you Mark, surely I'll go through given link. – rahul Jun 29 '23 at 16:46
  • Do you need to do it 6 separate times, with list of results for every pattern, or in one run with all numbers that match at least one of the patterns put into same list? – markalex Jun 29 '23 at 18:28
  • Hello Mark, I need to do it 6 separate times and extract the matching results. – rahul Jun 29 '23 at 18:40
  • `\d{4}(\d{3})\1` this is working pretty well. Can you assist me with others as well? I know I am asking for too much but I need this and it will be a learning for me as well. Thanks again. – rahul Jun 29 '23 at 18:49

2 Answers2

0

To check for same symbols, you can use backreferences.

In short, regex like (.)\1 will match two same symbols in a row, and (.)(.)\1\2 will match occurrences like abab.

Here expressions for your cases:

  1. \d{4}(\d{3})\1
  2. \d{4}(\d{2})\1{2}
  3. (\d{2})\1{2}\d{4}
  4. \d{4}(\d)\1{2}(\d)\2{2}
  5. \d{4}(\d)\1{3}(\d)\2
  6. \d{4}(\d)\1(\d)\2(\d)\3

Explanation for the firs one:

  • \d{4} matches any for digits,
  • (\d{3}) matches any three digits, and captures them into group #1,
  • \1 matches exact content of the group #1.

I hope that based on this explanation and general description how backreferences work, others expressions should be pretty clear.

Demo for the first one here.

markalex
  • 8,623
  • 2
  • 7
  • 32
-2

The simplest approach would be to write a pattern for each of the formats.
And then, append them with the | character.

Pattern 1 and 4.

\d{4} \d{3} \d{3}

Pattern 2 and 6.

\d{4} \d{2} \d{2} \d{2}

Pattern 3.

\d{2} \d{2} \d{2} \d{4}

Pattern 5.

\d{4} \d{4} \d{2}

The final pattern would be,

\d{4} \d{3} \d{3}|\d{4} \d{2} \d{2} \d{2}|\d{2} \d{2} \d{2} \d{4}|\d{4} \d{4} \d{2}

Or, simplified to,

\d{4}(?: \d{3}){2}|\d{4}(?: \d{2}){3}|(?:\d{2} ){3}\d{4}|(?:\d{4} ){2}\d{2}

You could then use the Pattern and Matcher classes to obtain each value.
Subsequently, use the String#replace method to remove the spaces.

This presumes they are within a text, and are each delimited by some other character.

I wouldn't rely on this pattern, if the values are sequential and not delimited.

String string = "3527 432 432, 3527 89 89 89, 35 35 35 8745, 5432 888 999, 5432 8888 99, 5432 33 44 22";
Pattern pattern = Pattern.compile("\\d{4}(?: \\d{3}){2}|\\d{4}(?: \\d{2}){3}|(?:\\d{2} ){3}\\d{4}|(?:\\d{4} ){2}\\d{2}");
Matcher matcher = pattern.matcher(string);
while (matcher.find())
    System.out.printf("%-20s = %s%n", matcher.group(), matcher.group().replace(" ", ""));

Output

3527 432 432         = 3527432432
3527 89 89 89        = 3527898989
35 35 35 8745        = 3535358745
5432 888 999         = 5432888999
5432 8888 99         = 5432888899
5432 33 44 22        = 5432334422

Here is a link to the Wikipedia article on regular expressions.
Wikipedia – Regular expression.

Reilas
  • 3,297
  • 2
  • 4
  • 17
  • Pattern `\d{4} \d{3} \d{3}` will match not only `1234 567 567`, but also `1234 567 890`. Based on description in question, it doesn't look like that's what OP is looking for. – markalex Jun 29 '23 at 17:55
  • Hi Reilas, I did not understand `append them with the | character` statement. Correct me if I am wrong but as Mark has said `\d{4} \d{3} \d{3}` is giving me all results which I don't need. – rahul Jun 29 '23 at 18:47
  • @rahul, the `|` character is a regex syntax. Here is the _[Wikipedia article](https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended)_. _'`|`, The choice (also known as alternation or set union) operator matches either the expression before or the expression after the operator. For example, abc|def matches "abc" or "def" ...'_. – Reilas Jun 29 '23 at 20:40
  • @rahul, now I see, your saying _ABCD_ in terms of each digit. Sure, I'll re-write it. – Reilas Jun 30 '23 at 07:17