2

I need to extract group of words out of a column in the database. Users saved terminus of bus line in a pretty nasty way, and I need to extract them.
For example :

'Bétheny La Couturelle - Croix Cordier - Tinqueux Champ Paveau'  
- {Bétheny La Couturelle}  
- {Croix Cordier}    
- {Tinqueux Champ Paveau}  

I've tried this pattern:

,'([a-zA-Zéèàîùê]+(\s|\-)?)+', 'g');

ex :

select regexp_matches('Bétheny La Couturelle - Croix Cordier - Tinqueux Champ Paveau','([a-zA-Zéèàîùê]+(\s|\-)?)+','g')````  

The 'g' flag for capturing every matches. But it doesn't work.
All I obtain was :

- {e , }  
- {r , }  
- {u,NULL}  

How may I succeed ?
Thanks in advance.

GMB
  • 216,147
  • 25
  • 84
  • 135
  • 1
    That is because of capturing groups. You should use non-capturing ones. However, you should never use `(a+b?)+` or ``(?:a+b?)+`` patterns as they tend to lead to catastrophic backtracking. Always make sure (if possible) that patterns matching at the same locations are not following one another in immediate succession. – Wiktor Stribiżew Feb 05 '20 at 09:22

1 Answers1

1

You may use

SELECT regexp_matches('Bétheny La Couturelle - Croix Cordier - Tinqueux Champ Paveau','[a-zA-Zéèàîùê]+(?:[\s-][a-zA-Zéèàîùê]+)*','g')

See the online demo.

Or, if the delimiter is always <spaces><-><spaces> you may use a splitting approach:

SELECT regexp_split_to_table('Bétheny La Couturelle - Croix Cordier - Tinqueux Champ Paveau', '\s+-\s+')

See another demo.

Pattern details

  • [a-zA-Zéèàîùê]+ - 1 or more letters in the character class
  • (?:[\s-][a-zA-Zéèàîùê]+)* - 0 or more sequences of
    • [\s-] - a whitespace or - (note it is equivalent to [[:space:]-])
    • [a-zA-Zéèàîùê]+ - 1 or more letters in the character class.

In the splitting code, \s+-\s+ matches 1+ whitespaces, - and again 1+ whitespaces.

Result:

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you. It works. Problem solved I think. Just to clarify : '?:' says that I don't want to catch space or '-'. Is that it ? But in the case of 'Trois-Puits - Trépail' I catch 'Trois-Puits' and 'Trépail' (and this is what I intend to). I'm a bit confused. PS. Thanks for the link to the demo I desperatly tried to find an online demo for postgresql and not java or php. – Brindavoine Feb 05 '20 at 09:59
  • @Brindavoine `(?:...)` is a [non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions). As per PostgreSQL docs, `regexp_matches` *returns a text array whose n'th element is the substring matching the n'th parenthesized subexpression of the pattern (not counting "non-capturing" parentheses)* if the pattern contains parenthesized subexpressions, – Wiktor Stribiżew Feb 05 '20 at 10:04
  • 1
    @Brindavoine `[a-zA-Zéèàîùê]+` is a bracket expression that matches 1 or more occurrences of ASCII letters (`a-zA-Z` part does that) or some extended letters (from the `éèàîùê` set). The fact that `[\s-]` only matches 1 whitespace/hyphen and `[a-zA-Zéèàîùê]+` matches 1 or more letters makes it possible to avoid matching `space`-`hyphen`-`space` strings. – Wiktor Stribiżew Feb 05 '20 at 10:08
  • Just seen two functions in PostgreSQL for splitting. Would it be an option here? – PJProudhon Feb 05 '20 at 10:09
  • 2
    If it is always `<->`, `try regexp_split_to_table` with `'\s+-\s+'` – Wiktor Stribiżew Feb 05 '20 at 10:10