0

How to write a regex that will combine numbers and chars in a string in any order? For example, If I want to read some kind of invoice number, I have example like this:

example 1: 32ah3af
example 2: 32ahPP-A2ah3af
example 3: 3A63af-3HHx3B-APe5y5-9OPiis
example 4: 3A63af 3HHx3B APe5y5 9OPiis

So each 'block' have length between 3 and 7 chars (letters or numbers) that can be in any order (letters can be lowercase or uppercase). Each. 'block' can start with letter or with number. It can have one "block" or max 4 blocks that are separated with ' ' or -.

I know that I can make separators like: \s or \-, but I have no idea how to make these kind of blocks that have (or do not have) separator.

I tried with something like this:

([0-9]?[A-z]?){3,7}

But it does not work

taga
  • 3,537
  • 13
  • 53
  • 119
  • [`\w{3,7}`](https://regex101.com/r/L26gna/1/) seems good enough? – Hao Wu Mar 30 '21 at 00:48
  • What is this data part of? It's going to be very difficult to write something that doesn't get match practically everything. A pattern that matches your invoice numbers will also match "wishy-washy" and "some kind invoice number". – Tim Roberts Mar 30 '21 at 01:01

2 Answers2

2

You could use

^[A-Za-z0-9]{3,7}(?:[ -][A-Za-z0-9]{3,7}){0,3}\b

The pattern matches:

  • ^ Start of string
  • [A-Za-z0-9]{3,7} Match 3-7 times either a lower or uppercase char a-z or number 0-9
  • (?: Non capture group
    • [ -][A-Za-z0-9]{3,7} Match either a space or - and 3-7 times either a lower or uppercase char a-z or number 0-9
  • ){0,3} Close the non capture group and repeat 0-3 times to have a maximum or 4 occurrences
  • \b A word boundary to prevent a partial match

Regex demo

Note that [A-z] matches more than [A-Za-z0-9]

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • It works but there is one case that does not work, for example: 325df j344j 25hi245bib245245hj, in this case it collects 325df j344j 25hi2, but it should collect only 325df j344j. Can you add fix for that please? Because I can have this case: 325df j344j someotherlongword, i need only first 2 blocks – taga Mar 30 '21 at 09:17
  • 1
    @taga Do you mean like this? `^[A-Za-z0-9]{3,7}(?:[ -][A-Za-z0-9]{3,7}\b){0,3}` https://regex101.com/r/GNDrku/1 – The fourth bird Mar 30 '21 at 09:18
0

As long as you want to only capture / search for the invoice ids, the suggestion from Hao Wu is valid:

 r'\w{3,7}'

for regex (check here).

If you can drop the remaining part, then this should be enough.

You can more precisely capture the whole string with example 1:

r'example (\d+): ((\w{3,7}[\- ]?)+)'

See here how it works. Please note how capturing groups are represented.

sophros
  • 14,672
  • 11
  • 46
  • 75