1

Requirement

I have a Postgres column containing values in two forms: personal and corporate names. Personal names contain a comma while corporate names do not.

_owner_titlecase
-------------------------
McCartney, James Paul
Lennon, John Winston Ono
Harrison, George
Starkey, Richard
The Beatles

I have to produce a query which abbreviates the personal names only, like this:

regexp_replace
-------------------------
McCartney, J P
Lennon, J W O
Harrison, G
Starkey, R
The Beatles

Background

After some performance tests, I've realised I can't use CASE to treat the two row types differently (as in CASE WHEN _owner_titlecase ~ ',' regexp_replace...). So I'm hoping there's a way to write a single regex which can treat the two types differently.

I previously asked about how to handle the initials part of personal names, and am now using (^\w+)|\Y\w regex as follows:

, regexp_replace(_owner_titlecase
    , '(^|;\s+)(\w+)|\Y\w'
    , '\1', 'g')

Now I've widened the scope to look at the corporate names, of course The Beatles is being abbreviated to The B.

The \Y is a Postgres regex character class which I learned about here which matches only at a point that is not the beginning or end of a word. While the special Postgres classes look like they could be of use in this situation, it would actually be useful to stick with regex features which are universal, so I can test them in places like regex101.com. Currently my only Postgres test platforms are somewhat awkward and provide no debugging help.

The whole story is that we have a CartoDB map upon which we want to superimpose a layer containing property owner names. Some properties are close together and the list of owner names can be long, hence the need to abbreviate.

Community
  • 1
  • 1
OutstandingBill
  • 2,614
  • 26
  • 38

1 Answers1

1

I suggest that you should use

regexp_replace(_owner_titlecase,
     '^([^,]*)$|(^|;\s+)([\w\u0027]+)|\Y\w',
     '\1\2\3', 'g')

The point is that you only need to remove any word char that is preceded with a word char, and keep everything else. So, any exception (text you need to keep) can be added as a captured alternative branch before the pattern you need to remove.

The ^([^,]*)$ part only matches and captures a string that consists of 0+ chars other than ,, and with \1 you restore it in the replacement result.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • The `O'Brien, Andrew John` problem (which comes back as `O'B, A J`) is easily fixed by extending the pattern slightly to `^([^,]*)$|(^|;\s+)([\w\u0027]+)|\Y[\w\u0027]`. I'm not sure which side of the `|` is responsible though. – OutstandingBill Feb 02 '17 at 10:19
  • 1
    I think the `\u0027` in the last branch is not necessary. See [`^([^,]*)$|(^|;\s+)([\w\u0027]+)|\Y\w` regex demo](https://regex101.com/r/CbydIO/1). It looks like that is enough. Names cannot have `O'` prefix. – Wiktor Stribiżew Feb 02 '17 at 10:22