0

I am new to Hive and Hadoop framework. I am trying to write a hive query to split the column delimited by a pipe '|' character. Then I want to group up the 2 adjacent values and separate them into separate rows.

Example, I have a table

id mapper

1  a|0.1|b|0.2
2  c|0.2|d|0.3|e|0.6
3  f|0.6

I am able to split the column by using split(mapper, "\\|") which gives me the array

id mapper

1  [a,0.1,b,0.2]
2  [c,0.2,d,0.3,e,0.6]
3  [f,0.6]

Now I tried to to use the lateral view to split the mapper array into separate rows, but it will separate all the values, where as I want to separate by group.

Expected:

id mapper

1  [a,0.1]
1  [b,0.2]
2  [c,0.2]
2  [d,0.3]
2  [e,0.6]
3  [f,0.6]

Actual

id mapper

1  a
1  0.1
1  b
1  0.2 
etc .......

How can I achieve this?

kars89
  • 48
  • 7

1 Answers1

1

I would suggest you to split your pairs split(mapper, '(?<=\\d)\\|(?=\\w)'), e.g.

split('c|0.2|d|0.3|e|0.6', '(?<=\\d)\\|(?=\\w)')

results in

["c|0.2","d|0.3","e|0.6"]

then explode the resulting array and split by |.

Update:

If you have digits as well and your float numbers have only one digit after decimal marker then the regex should be extended to split(mapper, '(?<=\\.\\d)\\|(?=\\w|\\d)').

Update 2:

OK, the best way is to split on the second | as follows

split(mapper, '(?<!\\G[^\\|]+)\\|')

e.g.

split('6193439|0.0444035224643987|6186654|0.0444035224643987', '(?<!\\G[^\\|]+)\\|')

results in

["6193439|0.0444035224643987","6186654|0.0444035224643987"]
serge_k
  • 1,772
  • 2
  • 15
  • 21
  • it works for the alpha numerics. but how to achieve if the input is like `split('10|0.2|20|0.3|30|0.6', '(?=\\w)\\|(?=\\w)')` will result in `["10|0.2|20|0.3|30|0.6"]` – kars89 Jul 02 '19 at 06:04
  • @kars89 , I edited the answer, now it works for both cases. – serge_k Jul 02 '19 at 06:50
  • Thanks @serge_k, I get the idea on what you are saying. But I have the more than one digit after decimal marker and some how I am not able to deduce the required output. My real world input is `"6193439|0.0444035224643987|6186654|0.0444035224643987"`. or Can you provide me a pointer for where I can look for the regex guide? – kars89 Jul 02 '19 at 07:00
  • @kars89 , I changed the regex so it splits on the second `|` – serge_k Jul 02 '19 at 07:44
  • Perfect @serge_k. works great. Could you please explain me the logic if possible, I am not able to grasp the regex. – kars89 Jul 02 '19 at 07:59
  • 1
    So the core idea is in `\\G` -- the end of the previous match operator (specific for Java). Here is good examples https://stackoverflow.com/questions/2708833/examples-of-regex-matcher-g-the-end-of-the-previous-match-in-java-would-be-ni . Basically, with negative lookbehind `?<!` and `G[^\\|]` we ignoring first ocurance of `|` and split on the second. Honestly I googled this regex once in the past and it works great. – serge_k Jul 02 '19 at 08:22