-1

is there a way in hive or impala to extract a string from a delimited string but only where the string i want doesnt match one or multiple patterns?

For instance, i have a field with IPs (the number varies depending on network adapters):

169.254.182.175,192.168.0.1,10.199.44.111

I would like to extract the IP that doesnt start with 169.254. (there could be many of these) and doesnt equal 192.168.0.1

The IPs can be in any order as well.

I tried doing substr with nested cases but due the unknown number of ips in the string it didnt work out.

Could this be accomplished with regex_extract or something similar?

Thanks,

nbk
  • 45,398
  • 8
  • 30
  • 47
Whitts
  • 75
  • 6
  • as impala has no split function, you should read https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad – nbk Jul 29 '22 at 18:35

1 Answers1

0

You may use regexp_replace with capturing group for patterns that you do not want to keep and specify only groups of interest in the replacement string.

See example below in Impala (impalad version 3.4.0):

select
  addr_list,
  /*Concat is used just for visualization*/
  rtrim(ltrim(regexp_replace(addr_list,concat(
    /*Group of 169.254.*.* that should be excluded*/
    '(169\\.254\\.\\d{1,3}\\.\\d{1,3})', '|',
    /*Another group for 192.168.0.1*/
    '(192\.168\.0\.1)', '|',
    /*And the group that we need to keep*/
    '(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})'
    
    /*So keep the third group in the output.
      Other groups will be replaced with empty string*/
  ), '\\3'), ','), ',') as ip_whitelist
from(values
 ('169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2' as addr_list),
 ('10.58.3.142,169.254.2.12'),
 ('192.168.0.1,192.100.0.2,154.16.171.3')
) as t
addr_list ip_whitelist
169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2 10.199.44.111
10.58.3.142,169.254.2.12 10.58.3.142
192.168.0.1,192.100.0.2,154.16.171.3 192.100.0.2,154.16.171.3

regexp_extract works differently for unknown reason, because the same regex with 3 as return group doesn't return anything at all for case 1 and 3.

select
  t.addr_list,
  rtrim(ltrim(regexp_replace(addr_list, r.regex, '\\3'), ','), ',') as ip_whitelist,
  regexp_extract(addr_list, r.regex, 3) as ip_wl_extract
from(values
  ('169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2' as addr_list),
  ('10.58.3.142,169.254.2.12'),
  ('192.168.0.1,192.100.0.2,154.16.171.3')
) as t
  cross join (
    select concat(
      '(169\\.254\\.\\d{1,3}\\.\\d{1,3})', '|',
      '(192\.168\.0\.1)', '|',
      '(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})'
    ) as regex
  ) as r
addr_list ip_whitelist ip_wl_extract
169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2 10.199.44.111
10.58.3.142,169.254.2.12 10.58.3.142 10.58.3.142
192.168.0.1,192.100.0.2,154.16.171.3 192.100.0.2,154.16.171.3
astentx
  • 6,393
  • 2
  • 16
  • 25
  • Thank you very much for explaining both of them. I used the regex_replace as you suggested and just removed any of the strings that match the 169. nd 254. and i was always left with the string i wanted no matter where it was: rtrim(ltrim(regexp_replace(d.ipaddress,'(169\\.254\\.\\d{1,3}\\.\\d{1,3})|(192\.168\.0\.1)|(,)', ''))) as RegIP – Whitts Aug 01 '22 at 13:43
  • @Whitts Yes, you are right. There's to need to capture a group we do want, but just replace other text. Thank you – astentx Aug 01 '22 at 15:45