Writing a proper regular expression that handles multiple spaces

Question

I'm struggling to write a proper regexp to use with PHP7.4 to extract required information from a string.

Here are the sample strings:

Numer właściciela: NOWAKOWSKA                                              01-234 Warsaw
Numer właściciela: NOWAK_S6_2
Numer właściciela: KOWALSKA_S6_                                            01-234 Warsaw
Numer właściciela: NOWACKI S6_                                             01-234 Warsaw

What I want to extract is accordingly:

NOWAKOWSKA
NOWAK_S6_2
KOWALSKA_S6_
NOWACKI S6_

So far I was using the %^Numer właściciela:[[:space:]](?<owner_id>.+)$%imu which worked fine (with example from row#2). However, turns out that the other cases (#1, #3, #4) appeared during a roll-out phase and our text extraction is not accurate enough.

The problem here is with spaces, the source text may contain space inside the pattern and this space must be included in the result. However, if there are repeating spaces, they must not be included.

Tried playing around with some conditionals and negative lookaheads to exclude multiple spaces, but failed to do so.

Would really appreciate any help here.

Your strings seem to have fixed positions for its content. Why regex? Just extract the parts from index to index. — trincot, Feb 16 '23 at 10:20
@WiktorStribiżew it's almost perfect, thank you! However it fails with the last expected result (the one that unf. has a space inside the expected result) — mkrowiarz, Feb 16 '23 at 10:29

Wiktor Stribiżew · Accepted Answer · 2023-02-16T10:47:40.060

In a general case, when you want to match sequences of chars separated with a single whitespace, you can use

/^Numer właściciela:\h*(?<owner_id>\S+(?:\h\S+)*)/imu

See the regex demo. \h is preferred to \s since you are extracting data from lines in a longer text, not standalone texts.

If the strings you extract are all short, you may also use

/^Numer właściciela:\h*(?<owner_id>.*?)(?:\h{2}|$)/imu

Then, it should be even more efficient, but only if they are that short as in the question. The .*? is usually as expensive as .* in strings of arbitrary length.

Pattern details:

^ - start of a line (due to m flag)
Numer właściciela: - a literal string (replace with \h to match any horizontal whitespace)
\h* - zero or more horizontal whitespaces
(?<owner_id>\S+(?:\h\S+)*) - Group "owner_id": one or more non-whitespace chars followed with zero or more sequences of a single horizontal whitespace followed with one or more non-whitespace chars.
(?<owner_id>.*?)(?:\h{2}|$) - Group "owner_id" that captures any zero or more chars other than line break chars as few as possible, and then either two horizontal whitespaces or end of a line.

Thank you so much! That does the trick on the examples above, i'll try running it against all our test data sets. — mkrowiarz, Feb 16 '23 at 10:35
@mkrowiarz If the strings you extract are all short, you may also use `/^Numer właściciela:\h*(?.*?)(?:\h{2}|$)/imu`, it should be even more efficient. But only if they are that short as in the question. The `.*?` is usually as expensive as `.*` in strings of arbitrary length. — Wiktor Stribiżew, Feb 16 '23 at 10:43

Gilles Quénot · Answer 2 · 2023-02-16T10:48:42.580

1

This regex:

/^Numer właściciela:\s+(?<owner_id>.*?)(?=\s{20,}|$)/imu

online demo

edited Feb 16 '23 at 10:48

answered Feb 16 '23 at 10:40

Gilles Quénot

173,512
41
224
223

Writing a proper regular expression that handles multiple spaces

2 Answers2

This regex: