I am querying data in Hive and extracting a code from a column. I recently discovered that due to data entry/business process issues, users have been overloading the field and entering two separate job codes when there should only be one.
Sample data from the column:
NOV2 WAA UW FOO DISPLAY_W2100008/ SOMETHING DISPLAY W2100106
I've been using REGEXP_EXTRACT(column,'([A-Z]\\d{7})',1) as id
will correctly extract the first code W2100008
, but I am unable to extract the second code W21001061
.
I want to use REGEXP_EXTRACT twice and alias id_1 and id_2 so we can analyze the second codes referenced. Is there a way to reference the second time the pattern is matched?
REGEXP_EXTRACT(column,'_([A-Z]\\d{7})',0)
returns the first match
REGEXP_EXTRACT(column,'([A-Z]\\d{7})',1)
returns the first match
REGEXP_EXTRACT(column,'([A-Z]\\d{7})',2)
returns an error
The extracted value will be used to join to another column, so the result needs to return a single value, not an array.