1

I have the following table1 with attribute co:

|-----------------------------------------
| co
|-----------------------------------------
| fsdsdf "This one" fdsfsd ghjhgj "sfdsf"
| Just This 
|-----------------------------------------

In case there are quotation mark - I would like to get the first occurrence content. If there is no quotation mark I would like to return the content as is. For the above example: For the first line - This one For the second line - Just This I have SQL code in Impala that solves the first case:

select regexp_extract (co, '"([^"]*")',1) from table1

How can I generalize it to detect and return the required results for the next case?

Mustofa Rizwan
  • 10,215
  • 2
  • 28
  • 43
Avi
  • 2,247
  • 4
  • 30
  • 52
  • I guess you can use `COALESCE` or `IF()` in your sql, to return the hole string if not matched. – Puggan Se Jan 18 '17 at 08:35
  • SQL isn't the best tool for doing this operation. How did you end up having to scrub this data inside Impala? – Tim Biegeleisen Jan 18 '17 at 08:38
  • The data is already in impala. I would like to use REGEX functionality and not the SQL functionality for the above since it is more efficient. I think the solution can use the OR expression in REGEX something like: http://stackoverflow.com/questions/8020848/and-or-operator-in-regular-expression – Avi Jan 18 '17 at 08:44

3 Answers3

2

You can not generalize it in impala. As far as the problem you are having it requires OR | implementation in your regex. With regex_extract you need to put capture group no. in the end . e.g.

select regexp_extract (co, '"([^"]*")',1) from table1

But with | operand in a regex, capture group will have to be different for both case. Which you can not define in your regex_extract method.

Say if (A)|(B) is your regex then for your first case capture group will be 1 and for your second case capture group will be 2 . But you can not put both 1 and 2 in your regex_extract syntax to date.

The Generic regex syntax would be (which i guess won't work in impala grouping):

^(?!.*")(.*)$|^[^"]*"(.*?)".*$

Watch out the capture groupings

In the link , you will see "This One" is captured as group 2 Where as Just this is captured as group 1

Mustofa Rizwan
  • 10,215
  • 2
  • 28
  • 43
1

Check This using union.

 select regexp_extract (co, '"([^"]*")',1) from table1
 union
 select co from table1 where co like '"%"'
Mr. Bhosale
  • 3,018
  • 1
  • 17
  • 34
1

You can use an if function and put RegEx functions inside for the arguments. So,

if(regexp_like(co,'"'),
   regexp_extract(co,'"([^"]*)',1), co)