4

Below regex works in Hive but not in Spark.

It throws an error dangling metacharacter * at index 3:

select regexp_extract('a|b||c','^(\\|*(?:(?!\\|\\|\\w(?!\\|\\|)).)*)');

I also tried escaping * with \\* but still it throws dangling metacharacter * at index 3.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Priya
  • 75
  • 5
  • Try `'^(.*?)(?=\\|\\|\\w(?!\\|\\|)|$)'` – Wiktor Stribiżew Feb 26 '21 at 16:40
  • 1
    Hi @WiktorStribiżew, this post is related to [another post](https://stackoverflow.com/questions/66380891/does-regexp-extract-work-for-multiple-patterns-spark-sql) of the same OP where I suggested a solution using Tempered Greedy Token with Negative Lookahead nested in the tempering pattern. The regex passed the test cases in regex101.com but it doesn't work when used in Spark sql. Your suggested modification here cannot match the original requirements. You can take a look there and suggest a way the OP can use it in Spark sql. Counter-proposal to my answer is also welcome! – SeaBean Feb 26 '21 at 17:23
  • 1
    Ok, I'd suggest a `regexp_replace` approach here: `regexp_replace(col, '^(.*)[|]{2}.*$', '$1')`. See [this regex demo](https://regex101.com/r/EolcHu/2). – Wiktor Stribiżew Feb 26 '21 at 17:48
  • Great! That's a much simpler approach based on underlying language syntax rather than solely relying on regex. Excellent! This syntax gets rid of the use of \ which I think maybe a reason the previous regex when bringing back to Spark sql doesn't work. – SeaBean Feb 26 '21 at 18:28
  • @Wiktor-This worked in Spark sql.Appreciate your help .Can you please explain on it a bit.I did not understand the substitution part. – Priya Feb 26 '21 at 19:48
  • @SeaBean Let's keep the answers in separate threads. I tagged this one appropriately. – Wiktor Stribiżew Feb 26 '21 at 20:26

2 Answers2

2

You can use

regexp_replace(col, '^(.*)[|]{2}.*$', '$1')

See the regex demo.

Regex details:

  • ^ - start of string
  • (.*) - Capturing group 1 (this group value is referred to with $1 replacement backreference in the replacement pattern): any zero or more chars other than line break chars, as many as possible (the rest of the line)
  • [|]{2} - double pipe (|| string)
  • .* - the rest of the line
  • $ - end of string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

This worked for me:

regexp_replace("***", "\\\*", "a")
lemon
  • 14,875
  • 6
  • 18
  • 38
Rajesh
  • 1