3

I have a text field. I need to identify the words between the pattern <a href and a>.

This pattern can be at the beginning/end/mid of the text.

with t as (
select '<a href Part of the technical Network Group www.tech.com/sites/ hh a>' as text from dual
union select '<a href www.tech.technical Network a>' as text from dual union
select 'www.tech.tech///technical <a href Network Group a>' as text from dual)
select * from t
WHERE REGEXP_LIKE(text,'(^|\W)<a href\S*','i') 

This gives me the first 2 rows results which is right. But I need to check for the word 'group' (case insensitive). How do we check the word 'group' and also the word should be with in the pattern. In this case, 1st and 3rd row should be returned.

Gary_W
  • 9,933
  • 1
  • 22
  • 40
ABY
  • 393
  • 2
  • 11

2 Answers2

1

Search for the complete pattern and then search for the word Group within the substring of that pattern. If there are be multiple matches within the text then you can use a recursive sub-query factoring clause to find them each:

Oracle Setup:

CREATE TABLE table_name ( id, text ) AS
select 1, '<a href Part of the technical Network Group www.tech.com/sites/ hh a>' from dual union all
select 2, '<a href www.tech.technical Network a>' from dual union all
select 3, 'www.tech.tech///technical <a href Network Group a>' from dual union all
select 4, '<a hrefgroup a>' FROM DUAL UNION ALL
select 5, '<a href groupa>' FROM DUAL UNION ALL
select 6, '<a href workgroup a>' FROM DUAL UNION ALL
select 7, '<a href test1 a> Group <a href test2 a>' FROM DUAL;

Query:

WITH positions ( id, text, match, position ) AS (
  SELECT id,
         text,
         REGEXP_SUBSTR(
           text,
           '(^|\W)<a href\s+.*?\s+a>(\W|$)',
           1,
           1,
           'i'
         ),
         REGEXP_INSTR(
           text,
           '(^|\W)<a href\s+.*?\s+a>(\W|$)',
           1,
           1,
           0,
           'i'
         )
  FROM   table_name
UNION ALL
  SELECT id,
         text,
         REGEXP_SUBSTR(
           text,
           '(^|\W)<a href\s+.*?\s+a>(\W|$)',
           position + 1,
           1,
           'i'
         ),
         REGEXP_INSTR(
           text,
           '(^|\W)<a href\s+.*?\s+a>(\W|$)',
           position + 1,
           1,
           0,
           'i'
         )
  FROM   positions
  WHERE  position > 0
)
SELECT id,
       text
FROM   positions
WHERE  REGEXP_LIKE( match, '\sGroup\s', 'i' );

Output:

ID | TEXT                                                                 
-: | :--------------------------------------------------------------------
 1 | <a href Part of the technical Network Group www.tech.com/sites/ hh a>
 3 | www.tech.tech///technical <a href Network Group a>                   

db<>fiddle here

MT0
  • 143,790
  • 11
  • 59
  • 117
  • This works. Thanks you. Is there a way to check this condition in a WHERE condition? I mean as part of REGEXP_LIKE. – ABY Sep 28 '19 at 05:36
1

You could just extend your regular expression like: <a href.*group.*a>.

Demo on DB Fiddle:

with t as (
    select '<a href Part of the technical Network Group www.tech.com/sites/ hh a>' as text from dual
    union all select '<a href www.tech.technical Network a>' as text from dual
    union all select 'www.tech.tech///technical <a href Network Group a>' as text from dual)
select * from t
WHERE REGEXP_LIKE(text,'<a href.*group.*a>','i') 
| TEXT                                                                  |
| :-------------------------------------------------------------------- |
| <a href Part of the technical Network Group www.tech.com/sites/ hh a> |
| www.tech.tech///technical <a href Network Group a>                    |

Note: This works as long as your text contains only one <a href ... a> pattern, which is the case in your sample data.


You can improve the regex to ensure that it matches only on the word 'group' (not on other words that contain 'group', like 'workgroup' or 'grouped'):

<a href.*\sgroup\s.*a>

This works as long as <a href is always followed by a space and a> is always preceeded by a space.

Demo on DB Fiddle

GMB
  • 216,147
  • 25
  • 84
  • 135