1
WITH t(x,y) AS (
    
    VALUES 
    (1,'[2]'),
    (2,'[1, 2]'),
    (3,'[2, 1]'),
    (4,'[3, 2, 5]'),
    (5,'[3, 2, 5, 2, 4]'),
    (6,'[3, 2, 2, 0, 4]')

)

--- my wrong answer below

SELECT

REGEXP_EXTRACT(y, '(\d+,\s)?(2)(,\s\d+)?') AS _1st,
REGEXP_EXTRACT(y,'(.*?(2)){1}.*?(\d+,\s(2)(,\s\d+)?)',3) AS _2nd,
REGEXP_EXTRACT(y,'(.*?(2)){2}.*?(\d+,\s(2)(,\s\d+)?)',3) AS _3rd

FROM t

Expected ans:
| x |        y        |   1st   |   2nd   |   nth   |
| - | --------------- | ------- | ------- | ------- |
| 1 | [2]             | 2       |         |         |
| 2 | [1, 2]          | 1, 2    |         |         |
| 3 | [2, 1]          | 2, 1    |         |         |
| 4 | [3, 2, 5]       | 3, 2, 5 |         |         |
| 5 | [3, 2, 5, 2, 4] | 3, 2, 5 | 5, 2, 4 |         |
| 6 | [3, 2, 2, 0, 4] | 3, 2, 2 | 2, 2, 0 |         |

Need help on the Regex for REGEXP_EXTRACT function in Presto to get the nth occurrence of number '2' and include the figures before and after it (if any)

Additional info:

  • The figures in column y are not necessary single digit.
  • Orders of the numbers are important
  • 1st, 2nd, 3rd refers to the nth occurrence of the number that I am seeking
  • Will be looking for a list of numbers, not just 2. Using 2 for illustration purpose.
IcySkullz
  • 33
  • 6

1 Answers1

0

Must it be a regular-expression?

If you see the text (VARCHAR) [1,2,3] as array-representation (JSON or internal data-type Array), you have more functions available to solve your task.

See related functions supported by Presto:

I would recommend to cast it as array of integers: CAST('[1,23,456]' AS ARRAY(INTEGER))

Finding the n-th occurrence

From Array functions, array_position(x, element, instance) → bigint to find the n-th occurrence:

If instance > 0, returns the position of the instance-th occurrence of the element in array x.

If instance < 0, returns the position of the instance-to-last occurrence of the element in array x.

If no matching element instance is found, 0 is returned.

Example:

SELECT CAST('[1,2,23,2,456]' AS ARRAY(INTEGER));
SELECT array_position(2, CAST('[1,2,23,2,456]' AS ARRAY(INTEGER)), 1); -- found in position 2

Now use the found position to build your slice (relatively from that).

Slicing and extracting sub-arrays

  1. either parse it as JSON to a JSON-array. Then use a JSON-path to slice (extract a sub-array) as desired: Array slice operator in JSON-path: [start, stop, step]

  2. or cast it as Array and then use slice(x, start, length) → array

Subsets array x starting from index start (or starting from the end if start is negative) with a length of length.

Examples:

SELECT json_extract(json_parse('[1,2,3]'), '$[-2, -1]');  -- the last two elements

SELECT slice(CAST('[1,23,456]' AS ARRAY(INTEGER)), -2, 2); -- [23, 456]
hc_dev
  • 8,389
  • 1
  • 26
  • 38
  • Thanks for commenting, may I know how does the solution find the nth occurrence of 2 and return the figures before and after 2 ( _,2 ,_). I am looking for a few numbers, not just 2, was intending of using | for Regex – IcySkullz Feb 11 '22 at 14:16
  • The figures in column y are not necessary single digit. – IcySkullz Feb 11 '22 at 14:21
  • @IcySkullz, the number of digits does not matter, as long as the elements are numbers (integers) and inside an array. The basic steps are (1) parse or cast text to array, (2) find occurrence, (3) build-slice based on the position, (4) extract slice from array – hc_dev Feb 11 '22 at 14:35
  • Thanks, its clearer now. I encountered this error "Cannot cast varchar(10) to array(integer)" when doing CAST('[1,23,456]' AS ARRAY(INTEGER)) – IcySkullz Feb 11 '22 at 17:38
  • @IcySkullz Did you research this error-message? I found help in [related question/answer](https://stackoverflow.com/questions/56167704/unable-to-convert-varchar-to-array-in-presto-athena). – hc_dev Feb 11 '22 at 18:39
  • yup saw that when I was googling the error. However, I don't really understand the answer as I am still trying to understand the JSON function – IcySkullz Feb 12 '22 at 07:29
  • cast(json_parse('[1,23,456]') AS array(INT)) works but stuck at array_position function. "Unexpected parameters (integer, array(integer), integer) for function array_position. Expected: array_position(array(t), t)" Seems like system only accept 2 inputs for the function – IcySkullz Feb 14 '22 at 01:50