1

I created a simple word search in a web application that goes through all documents stored in our Oracle 12c database and displays links to those documents that contain the specific word. In addition it orders the list of documents based on the number of occurrences in each document.

The documents are simple html formatted texts stored in NCLOB datatype column, first few lines:

<div class="content3">
<div class="content_bg">
<div class="mainbar3">
<div class="article">
<h2>Welcome to Web<br>
</h2>          
<p>This web  is categorized into several sub-webs which are accessible from the main page:</p>
<p><strong>Teams</strong></p>          
<p>This sub web is accessible for everyone in o

The "web document" is displayed in the browser upon clicking the particular link in the search output.

In the search output (list of links to the documents) I would like to include excerpt of each document with the search word highlighted. Note that each document can contain any number greater than 1 of occurrences of the word and the search is supposed to be case insensitive. The search output contains only links to documents where the word occurs at least once.

Here is where I have got so far:

select 'Documents', 
'open_document.aspx?id=' || doc_id, 
regexp_replace(regexp_replace(web_code, '<(.|\n)*?>', ''), '(word)', '<b><span style="background-color: #ffff00;">\1</span></b>\2', 1, 0, 'i') web_code, 
doc_name, 
regexp_count(regexp_replace(web_code, '<(.|\n)*?>', ''), 'word', 1, 'i') occurences 
from web_knowledge_base 
where lower(web_code) like '%word%';

This displays the links to documents which contain the search word, number of occurrences that is later used to order the list of links, and it also displays the html documents with the search word highlighted (that is the regexp_replace with style part).

Any way to limit what is displayed in the html document (the regexp_replace part)

a) To display only the sentences that contain the search word for each occurrence

b) To display 10 characters before and after each occurrence of the search word

while still having the search word highlighted and search case insensitive?

I'd like to do this as part of the select statement if possible.

Thanks a lot!

Lukas
  • 11
  • 1
  • 1
    Oh, dear. [Parsing HTML with a regex](http://stackoverflow.com/a/1732454/213136). C̷ţḩu̢l͡hu fh̛ta̛g͝n͟!!! – Bob Jarvis - Слава Україні Nov 27 '16 at 15:06
  • @BobJarvis I understand that the use of regex for parsing HTML is not recommended / best practice however for my case it's working perfectly fine, so why not? It's for intranet html documents that follow the same tags / formatting / template....And my question is not really about parsing HTML because I have got that part resolved. I need to limit what is displayed before and after the search word in the parsed out output. Anyway, thanks for posting the link, it was a good read. – Lukas Nov 28 '16 at 08:24

0 Answers0