Splitting a string by a regular expression

Question

I have a string with words that sometimes have after them ,, ;, ., :. It is possible to have one or more spaces after these punctuation characters. I need to extract just the existent words.

For example, from:

'I use Python, SQL, C++. I need: apples and oranges'

I need to get:

I, use, Python, SQL, C++, I, need, apples, and, oranges

The split_part function can be used to split on a character. Is there a way to use this function (or a similar one) to get this splitting. I am thinking on a regular expression instead o the character but I am not sure which one to use.

score 4 · Accepted Answer · answered Jun 14 '21 at 19:16

4

Yes, a regex can easily do this :)

# SELECT regexp_split_to_table(
    'I use Python, SQL, C++. I need: apples and oranges',
    '[ .,:;]+');
┌───────────────────────┐
│ regexp_split_to_table │
├───────────────────────┤
│ I                     │
│ use                   │
│ Python                │
│ SQL                   │
│ C++                   │
│ I                     │
│ need                  │
│ apples                │
│ and                   │
│ oranges               │
└───────────────────────┘
(10 rows)

answered Jun 14 '21 at 19:16

Wolph

78,177
11
137
148

2

I'd recommend `'[,;.:]? +'` or `'[,;.:]?\s+'` as pattern to respect multiple spaces or white spaces in general. – sticky bit Jun 14 '21 at 19:21
@stickybit - why do you say that about respecting white space? – Vérace Nov 27 '21 at 16:24
@Vérace-getVACCINATEDNOW: Why not? It's more flexible, if one wants that. – sticky bit Nov 27 '21 at 16:36
@stickybit - I'd only go for that if I were **specifically** asked - still, it's a nice little one to remember *au cas où*. Regular expressions aren't my strongest suit - though I have been looking at them recently! As I said elsewhere, a Swiss-Army knife for strings, but abstruse, complex and obscure which, along with subtle differences in different environments' implementations, make them a propeller-head's wet-dream! – Vérace Nov 27 '21 at 16:48

score 1 · Answer 2 · answered Jun 14 '21 at 23:19

`ts_parse()`

You could use ts_parse() with the 'default' text search parser:

test=> SELECT token
test-> FROM   ts_parse ('default', 'I use Python, SQL, C++. I need: apples and oranges')
test-> WHERE  tokid <> 12;
  token  
---------
 I
 use
 Python
 SQL
 C
 I
 need
 apples
 and
 oranges
(10 rows)

Unfortunately, the + sign is also trimmed, being categorized as "blank" or "Space symbols" (tokid = 12). That's how the 'default' text search parser operates. You could write your own, but it's not trivial, must be done in the C language and as superuser.

Apart from the +, it would be perfect, and much faster than regexp processing.

To see a complete list of token types known to the default parser:

SELECT * FROM ts_token_type('default');

To get a detailed debug view:

SELECT *
FROM   ts_debug('I use Python, SQL, C++. I need: apples and oranges foo@sdf.at 234 0699/19477759 -132 -34.547')

Regular expression

While sticking to regular expressions, you can use regexp_split_to_table() like Wolph suggested. I'd suggest this pattern for your request:

SELECT regexp_split_to_table('...', '[\s,;.:]+');

Takes any string consisting only of whitespace characters and your listed punctuation marks as delimiter.

\s is a class shorthand for [[:space:]] (whitespace characters) and can be included in a character class like demonstrated. Equivalent: '[[:space:],;.:]+'. It covers most non-printing characters. But UNICODE sucks in this regard, and there are always more of them. See:

Trim trailing spaces with PostgreSQL

Splitting a string by a regular expression

2 Answers2

ts_parse()

Regular expression

`ts_parse()`