17

I was wondering if someone could help me understand how to use Hive's regexp_replace function to capture groups in the regex and use those groups in the replacement string.

I have an example problem I'm working through below that involves date-munging. In this example, my goal is to take a string date that is not compatible with SimpleDateFormat parsing and make a small adjustment to get it to be compatible. The date string (shown below) needs "GMT" prepended to the offset sign (+/-) in the string.

So, Given input:

  '2015-01-01 02:03:04 +0:00' 
  -or-
  '2015-01-01 02:03:04 -1:00' 

I want output:

  '2015-01-01 02:03:04 GMT+0:00'
  -or-
  '2015-01-01 02:03:04 GMT-1:00'

Here is a simple example of a statement that I 'thought' would work, but I get strange output.

Hive query:

select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT\1');

Actual result:

2015-01-01 02:03:04 GMT10:00

Note that the "\1" should output the matched group, but instead replaces the matched group with the number "1".

Can someone help me understand the right way to reference/output matched groups in the replacement string?

Thanks!

jatal
  • 790
  • 1
  • 10
  • 19

3 Answers3

22

Hive's supported notation (at least for 0.14, and I think I recall it being this way for 0.13.x as well) for regex backreferences seems to be $1 for capture group 1, $2 for capture group 2, etc. It looks like it is based upon (and may even be implemented by) the replaceAll method from the Matcher class. This is the germane portion of that documentation:

Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.

So I think what you want is this:

select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT$1');

For example:

hive> select regexp_replace('2015-01-01 02:03:04 +0:00', ' ([+-])', ' GMT$1');
OK
2015-01-01 02:03:04 GMT+0:00
Time taken: 0.072 seconds, Fetched: 1 row(s) 
hive> select regexp_replace('2015-01-01 02:03:04 -1:00', ' ([+-])', ' GMT$1');
OK
2015-01-01 02:03:04 GMT-1:00
Time taken: 0.144 seconds, Fetched: 1 row(s)
rchang
  • 5,150
  • 1
  • 15
  • 25
1

Nither '\1' nor '$1' worked for me while trying to back-reference the captured groups in REGEXP_REPLACE. However this worked: https://www.logicbig.com/tutorials/core-java-tutorial/java-regular-expressions/group-ref-in-replacement.html

An example: (replace hash with hyphen)

hive> select REGEXP_REPLACE('foo#bar','(?<tag1>foo)#(?<tag2>bar)', '${tag1}-${tag2}');
OK
foo-bar
Time taken: 0.085 seconds, Fetched: 1 row(s)

Hope this is helpful.
VAS11
  • 31
  • 4
0

Regex with capturing group is working in my AWS Athena SQL query (thanks @rchang) :

SELECT
  regexp_replace('toto-prod-titi-prod-tata', '-prod-(.*?)-prod-', '-uat-$1-uatr-')

gives me as expected :

toto-uat-titi-uatr-tata

Ismael EL ATIFI
  • 1,939
  • 20
  • 16