ts_parse()
You could use ts_parse()
with the 'default' text search parser:
test=> SELECT token
test-> FROM ts_parse ('default', 'I use Python, SQL, C++. I need: apples and oranges')
test-> WHERE tokid <> 12;
token
---------
I
use
Python
SQL
C
I
need
apples
and
oranges
(10 rows)
Unfortunately, the +
sign is also trimmed, being categorized as "blank" or "Space symbols" (tokid = 12
). That's how the 'default'
text search parser operates. You could write your own, but it's not trivial, must be done in the C language and as superuser.
Apart from the +
, it would be perfect, and much faster than regexp processing.
To see a complete list of token types known to the default parser:
SELECT * FROM ts_token_type('default');
To get a detailed debug view:
SELECT *
FROM ts_debug('I use Python, SQL, C++. I need: apples and oranges foo@sdf.at 234 0699/19477759 -132 -34.547')
Regular expression
While sticking to regular expressions, you can use regexp_split_to_table()
like Wolph suggested. I'd suggest this pattern for your request:
SELECT regexp_split_to_table('...', '[\s,;.:]+');
Takes any string consisting only of whitespace characters and your listed punctuation marks as delimiter.
\s
is a class shorthand for [[:space:]]
(whitespace characters) and can be included in a character class like demonstrated. Equivalent: '[[:space:],;.:]+'
. It covers most non-printing characters. But UNICODE sucks in this regard, and there are always more of them. See: