1

I'm using Hive, and have a database of strings. I want the rows with certain words, but need word boundaries (if I'm searching for "transport", I don't want a row with "transporter"). How do I do this?

Currently, I'm doing something like the following:

hive> select * from myTable where (length(regexp_extract(column1, 'transport', 0)) > 0);

But obviously this is not incorporating any word boundaries.

Thanks in advance!

maia
  • 3,910
  • 4
  • 27
  • 34

2 Answers2

3

Hive uses regular expression syntax. You can try the following.

regexp_extract(column1, '\btransport\b')

If it fails, try double escaping your word boundaries \\b

hwnd
  • 69,796
  • 4
  • 95
  • 132
1

I know that in POSIX compliant regexes, word boundaries are like this: [[:<:]] and [[:>:]]:

[[:<:]]transport[[:>:]]

However, I don't know if it is supported by hiveQL.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • [[:<:]]transport[[:>:]] works but only for Strings without other words in them (for some or other reason). I found something that DOES work though, since each word was either bordered by a space or "<" on the left side, or ">" on the right side, I went with: hive> select * from myTable where (column1 regexp '(.*)[ |<]'transport[ |>](.*)'); – maia Nov 18 '13 at 19:50
  • @user3005925: did you try `select * from myTable where column1 regexp '[[:<:]]transport[[:>:]]';`? Because it is strange that you must describe all the string by adding some `(.*)` after and before! – Casimir et Hippolyte Nov 18 '13 at 23:33