Regex to match different groups depending on string content

Question

Requirement

I have a Postgres column containing values in two forms: personal and corporate names. Personal names contain a comma while corporate names do not.

_owner_titlecase
-------------------------
McCartney, James Paul
Lennon, John Winston Ono
Harrison, George
Starkey, Richard
The Beatles

I have to produce a query which abbreviates the personal names only, like this:

regexp_replace
-------------------------
McCartney, J P
Lennon, J W O
Harrison, G
Starkey, R
The Beatles

Background

After some performance tests, I've realised I can't use CASE to treat the two row types differently (as in CASE WHEN _owner_titlecase ~ ',' regexp_replace...). So I'm hoping there's a way to write a single regex which can treat the two types differently.

I previously asked about how to handle the initials part of personal names, and am now using (^\w+)|\Y\w regex as follows:

, regexp_replace(_owner_titlecase
    , '(^|;\s+)(\w+)|\Y\w'
    , '\1', 'g')

Now I've widened the scope to look at the corporate names, of course The Beatles is being abbreviated to The B.

The \Y is a Postgres regex character class which I learned about here which matches only at a point that is not the beginning or end of a word. While the special Postgres classes look like they could be of use in this situation, it would actually be useful to stick with regex features which are universal, so I can test them in places like regex101.com. Currently my only Postgres test platforms are somewhat awkward and provide no debugging help.

The whole story is that we have a CartoDB map upon which we want to superimpose a layer containing property owner names. Some properties are close together and the list of owner names can be long, hence the need to abbreviate.

I believe you may use `^([^,]*)$|(^|;\s+)(\w+)|\Y\w` pattern and replace with `'\1\2\3'` — Wiktor Stribiżew, Feb 02 '17 at 09:48
Thank you again @Wictor. This works for most personal names. It doesn't quite get `O'Brien, John Andrew` right. — OutstandingBill, Feb 02 '17 at 10:00
I maybe don't understand the problem correctly, but couldn't you just filter with a `where`-statement before like `WHERE name LIKE '%,%'` — Fallenhero, Feb 02 '17 at 10:00
@Fallenhero, that wouldn't work because it would stop the corporate names from appearing in the results, and hence on the map. — OutstandingBill, Feb 02 '17 at 10:02

Wiktor Stribiżew · Accepted Answer · 2017-02-02T10:26:23.020

1

I suggest that you should use

regexp_replace(_owner_titlecase,
     '^([^,]*)$|(^|;\s+)([\w\u0027]+)|\Y\w',
     '\1\2\3', 'g')

The point is that you only need to remove any word char that is preceded with a word char, and keep everything else. So, any exception (text you need to keep) can be added as a captured alternative branch before the pattern you need to remove.

The ^([^,]*)$ part only matches and captures a string that consists of 0+ chars other than ,, and with \1 you restore it in the replacement result.

edited Feb 02 '17 at 10:26

answered Feb 02 '17 at 10:03

Wiktor Stribiżew

607,720
39
448
563

The `O'Brien, Andrew John` problem (which comes back as `O'B, A J`) is easily fixed by extending the pattern slightly to `^([^,]*)$|(^|;\s+)([\w\u0027]+)|\Y[\w\u0027]`. I'm not sure which side of the `|` is responsible though. – OutstandingBill Feb 02 '17 at 10:19
1

I think the `\u0027` in the last branch is not necessary. See [`^([^,]*)$|(^|;\s+)([\w\u0027]+)|\Y\w` regex demo](https://regex101.com/r/CbydIO/1). It looks like that is enough. Names cannot have `O'` prefix. – Wiktor Stribiżew Feb 02 '17 at 10:22

Regex to match different groups depending on string content

1 Answers1