2

I am querying data in Hive and extracting a code from a column. I recently discovered that due to data entry/business process issues, users have been overloading the field and entering two separate job codes when there should only be one.

Sample data from the column:

NOV2 WAA UW FOO DISPLAY_W2100008/ SOMETHING DISPLAY W2100106

I've been using REGEXP_EXTRACT(column,'([A-Z]\\d{7})',1) as id will correctly extract the first code W2100008, but I am unable to extract the second code W21001061.

I want to use REGEXP_EXTRACT twice and alias id_1 and id_2 so we can analyze the second codes referenced. Is there a way to reference the second time the pattern is matched?

REGEXP_EXTRACT(column,'_([A-Z]\\d{7})',0) returns the first match

REGEXP_EXTRACT(column,'([A-Z]\\d{7})',1) returns the first match

REGEXP_EXTRACT(column,'([A-Z]\\d{7})',2) returns an error

The extracted value will be used to join to another column, so the result needs to return a single value, not an array.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Leigh
  • 21
  • 1
  • 4
  • @WiktorStribiżew Better solution is possible - extract all of them into array in Hive – leftjoin Jul 10 '18 at 18:57
  • 1
    Exact match: https://stackoverflow.com/questions/42125647/handling-multiple-matches-in-regex-in-hive – leftjoin Jul 10 '18 at 18:59
  • Another better idea is to use `REGEXP_EXTRACT(column,'[A-Z]\\d{7}.*?([A-Z]\\d{7})',1) as id2` – Wiktor Stribiżew Jul 10 '18 at 20:57
  • @leftjoin - I reviewed the link you referenced before I posted this question and I still do not see how it would solve my problem. I've updated the question to explain that the value returned will be used in a join and hence cannot be an array. – Leigh Jul 11 '18 at 19:28
  • 1
    @wiktorStribizew Thank you for your `id2` example. It worked perfectly. – Leigh Jul 11 '18 at 19:28
  • @wiktorStribizew I don't understand how this is a duplicate of matching strings in Java when I can't use Java for this task and the code referenced wouldn't run in Hive. Can you please explain why you think that this question is answered there? – Leigh Jul 11 '18 at 19:33
  • Please read the marked topic and if it didn't answer your question, edit accordingly. – revo Jul 11 '18 at 21:25
  • @LeighF. Yes, if t does not help, let know. – Wiktor Stribiżew Jul 12 '18 at 16:18

1 Answers1

0

Replace all '.*?([A-Z]\\d{7})' with delimiter(space) + ([A-Z]\\d{7}). Remove first space using trim, split by ' ' to get array:

hive> select split(trim(regexp_replace('NOV2 WAA UW FOO DISPLAY_W2100008/SOMETHING DISPLAY W2100106','.*?([A-Z]\\d{7})',' $1')),' ');
OK
["W2100008","W2100106"]

Get first element:

hive> select split(trim(regexp_replace('NOV2 WAA UW FOO DISPLAY_W2100008/ SOMETHING DISPLAY W2100106','.*?([A-Z]\\d{7})',' $1')),' ')[0];
OK
W2100008
Time taken: 0.065 seconds, Fetched: 1 row(s)

And second element is

split(trim(regexp_replace('NOV2 WAA UW FOO DISPLAY_W2100008/ SOMETHING DISPLAY W2100106','.*?([A-Z]\\d{7})',' $1')),' ')[1]

better use subquery to parse array one time.

select display_array[0] as id_1 , display_array[1] as id_2
from
(
select split(trim(regexp_replace('NOV2 WAA UW FOO DISPLAY_W2100008/ SOMETHING DISPLAY W2100106','.*?([A-Z]\\d{7})',' $1')),' ') as display_array
)s;

Use explode() if you want each element per row.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • OP only needs to get the second occurrence of the pattern, not all occurrences. Replacing, splitting, trimming and retrieving via index is too much overhead, isn't it? – Wiktor Stribiżew Jul 11 '18 at 19:50
  • @WiktorStribiżew "I want to use REGEXP_EXTRACT twice and alias id_1 and id_2 " – leftjoin Jul 11 '18 at 19:51
  • @WiktorStribiżew I agree with you. If only two elements, there is no need to get array – leftjoin Jul 11 '18 at 19:53
  • Well, I understood that literally, that OP needs a separate REGEXP_EXTRACT call to get the value separately. And I [got confirmation from OP](https://stackoverflow.com/questions/51271742/hive-regexp-extractextract-the-second-occurrence-of-a-pattern/51293140?noredirect=1#comment89563426_51271742) that my suggestion works as expected. – Wiktor Stribiżew Jul 11 '18 at 19:53