1

I have a string with words that sometimes have after them ,, ;, ., :. It is possible to have one or more spaces after these punctuation characters. I need to extract just the existent words.

For example, from:

'I use Python, SQL, C++. I need: apples and oranges'

I need to get:

I, use, Python, SQL, C++, I, need, apples, and, oranges

The split_part function can be used to split on a character. Is there a way to use this function (or a similar one) to get this splitting. I am thinking on a regular expression instead o the character but I am not sure which one to use.

Jose Cabrera Zuniga
  • 2,348
  • 3
  • 31
  • 56

2 Answers2

4

Yes, a regex can easily do this :)

# SELECT regexp_split_to_table(
    'I use Python, SQL, C++. I need: apples and oranges',
    '[ .,:;]+');
┌───────────────────────┐
│ regexp_split_to_table │
├───────────────────────┤
│ I                     │
│ use                   │
│ Python                │
│ SQL                   │
│ C++                   │
│ I                     │
│ need                  │
│ apples                │
│ and                   │
│ oranges               │
└───────────────────────┘
(10 rows)
Wolph
  • 78,177
  • 11
  • 137
  • 148
  • 2
    I'd recommend `'[,;.:]? +'` or `'[,;.:]?\s+'` as pattern to respect multiple spaces or white spaces in general. – sticky bit Jun 14 '21 at 19:21
  • @stickybit - why do you say that about respecting white space? – Vérace Nov 27 '21 at 16:24
  • @Vérace-getVACCINATEDNOW: Why not? It's more flexible, if one wants that. – sticky bit Nov 27 '21 at 16:36
  • @stickybit - I'd only go for that if I were **specifically** asked - still, it's a nice little one to remember *au cas où*. Regular expressions aren't my strongest suit - though I have been looking at them recently! As I said elsewhere, a Swiss-Army knife for strings, but abstruse, complex and obscure which, along with subtle differences in different environments' implementations, make them a propeller-head's wet-dream! – Vérace Nov 27 '21 at 16:48
1

ts_parse()

You could use ts_parse() with the 'default' text search parser:

test=> SELECT token
test-> FROM   ts_parse ('default', 'I use Python, SQL, C++. I need: apples and oranges')
test-> WHERE  tokid <> 12;
  token  
---------
 I
 use
 Python
 SQL
 C
 I
 need
 apples
 and
 oranges
(10 rows)

Unfortunately, the + sign is also trimmed, being categorized as "blank" or "Space symbols" (tokid = 12). That's how the 'default' text search parser operates. You could write your own, but it's not trivial, must be done in the C language and as superuser.

Apart from the +, it would be perfect, and much faster than regexp processing.

To see a complete list of token types known to the default parser:

SELECT * FROM ts_token_type('default');

To get a detailed debug view:

SELECT *
FROM   ts_debug('I use Python, SQL, C++. I need: apples and oranges foo@sdf.at 234 0699/19477759 -132 -34.547')

Regular expression

While sticking to regular expressions, you can use regexp_split_to_table() like Wolph suggested. I'd suggest this pattern for your request:

SELECT regexp_split_to_table('...', '[\s,;.:]+');

Takes any string consisting only of whitespace characters and your listed punctuation marks as delimiter.

\s is a class shorthand for [[:space:]] (whitespace characters) and can be included in a character class like demonstrated. Equivalent: '[[:space:],;.:]+'. It covers most non-printing characters. But UNICODE sucks in this regard, and there are always more of them. See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228