handling HTML data in Oracle query

Question

I have a table with ID,TEXT,etc columns Here TEXT is clob column which contains data in HTML FORMAT

SELECT ID,TEXT FROM TABLE WHERE ID=1000

I'm getting output for text column as below

<p>NAME:&nbsp;&nbsp;XXX<br />Company Name: &nbsp;YYYYY<br />Location:&nbsp;ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND<br />Type:&nbsp;PrePA<br />
Team:&nbsp;Team1, Dues tamble <br />Date:&nbsp;January 25 &ndash; 26, 2016<br />Rating: &nbsp;Tr 2<br />Number:&nbsp;8554342</p>
<p><u>Observ: <br /></u>There were (6) major and (2) minor .<br />&nbsp;<br />MAJOR</p><ul>    <li>Sample Text_1.</li>    <li>Sample Text_2.</li>    
<li>Sample Text_33.</li>    <li>Sample Text_4.</li>    <li>Sample Text_5.</li></ul><p>MINOR</p><ul>    <li>Sample Text_7</li>    
<li>Sample Text_8<br />    &nbsp;</li></ul><p><b> Background</b><br />&nbsp;</p>

I need output like below

NAME:  XXX
Company Name:  YYYYY
Location: ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND
Type: PrePA
Team: Team1, Dues tamble 
Date: January 25 – 26, 2016
Rating:  Tr 2
Number: 8554342

Observ: 
There were (6) major and (2) minor .

MAJOR

Sample Text_1. 
Sample Text_2. 
Sample Text_33. 
Sample Text_4. 
Sample Text_5.
MINOR

Sample Text_7 
Sample Text_8

Background

Which means the query should handle HTML data and should provide out like exact data like above.

I have a query beolw which handling somwhat okey but it's not handling the

exactly the HTML tags.
select ID,
trim(regexp_replace (trim
(regexp_replace( TEXT ,'<[^>]+/>|</[^>]+>|<p>',CHR(13)||CHR(10)))
,'<[^>]+>',''))
from TABLE where ID='1000'

Acually i need to handle all the html tags in the data not only the which i showed is sample.

score 3 · Answer 1 · answered Feb 17 '16 at 14:04

Searching for a 'perfect' query might be a lost cause; browser (including text browsers, as recommended elsewhere) have spent years working out all the kinks and edge cases.

If you really can't use an external resource you could walk the DOM via the dbms_xmldom package. Here's a demo using an anonymous block and your sample value, which prints to the screen (if you have that enabled); but you could easily adapt it in to a function that returns a CLOB, or whatever you need as an end result.

set serveroutput on
set define off

declare     
  l_element xmldom.domelement;
  l_document xmldom.domdocument;
  l_text clob;

  procedure print_node (p_node xmldom.domnode) is
    l_nodes dbms_xmldom.domnodelist;
  begin
    -- print out any plain text
    if dbms_xmldom.getnodetype(p_node) = 3
        and dbms_xmldom.getnodename(p_node) = '#text' then
      dbms_output.put_line(dbms_xmldom.getnodevalue(p_node));
    end if;
    -- just to match your expected output, add a blank line for p tags
    if dbms_xmldom.getnodetype(p_node) = 1
        and dbms_xmldom.getnodename(p_node) = 'p' then
      dbms_output.new_line;
    end if;
    -- get any child nodes
    l_nodes := dbms_xmldom.getchildnodes(p_node);
    -- process each node in turn, recursively
    for i in 0..dbms_xmldom.getlength(l_nodes) - 1 loop
      print_node(dbms_xmldom.item(l_nodes, i));
    end loop;
  end print_node;

begin           
  l_text := '<p>NAME:&nbsp;&nbsp;XXX<br />Company Name: &nbsp;YYYYY<br />Location:&nbsp;ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND<br />Type:&nbsp;PrePA<br />Team:&nbsp;Team1, Dues tamble <br />Date:&nbsp;January 25 &ndash; 26, 2016<br />Rating: &nbsp;Tr 2<br />Number:&nbsp;8554342</p>
<p><u>Observ: <br /></u>There were (6) major and (2) minor .<br />&nbsp;<br />MAJOR</p><ul>    <li>Sample Text_1.</li>    <li>Sample Text_2.</li>    
<li>Sample Text_33.</li>    <li>Sample Text_4.</li>    <li>Sample Text_5.</li></ul><p>MINOR</p><ul>    <li>Sample Text_7</li>    
<li>Sample Text_8<br />    &nbsp;</li></ul><p><b> Background</b><br />&nbsp;</p>'; 

  -- wrap the fragment in a root node so it parses, and run through
  -- utl_i18n.unescape_reference to get rid of &nbsp; etc.
  l_document := dbms_xmldom.newdomdocument('<html>'
    || utl_i18n.unescape_reference(l_text) || '</html>');
  -- get the root element (which is now the added html)
  l_element := dbms_xmldom.getdocumentelement(l_document);
  -- call the recursive procedure to process this node
  print_node(dbms_xmldom.makenode(l_element));
end;
/

Which produces:

NAME:  XXX
Company Name:  YYYYY
Location: ZZZ, 22 Z1Z1Z1, Z2Z2Z2,Z3Z3Z3, 0000024, IND
Type: PrePA
Team: Team1, Dues tamble 
Date: January 25 – 26, 2016
Rating:  Tr 2
Number: 8554342

Observ: 
There were (6) major and (2) minor .
 
MAJOR
Sample Text_1.
Sample Text_2.
Sample Text_33.
Sample Text_4.
Sample Text_5.

MINOR
Sample Text_7
Sample Text_8
     

 Background

Good example, but +1 for the first line. Trying to write a regex to parse HTML warrants an obligatory link to one of the top posts - http://stackoverflow.com/a/1732454/4499055 — ruudvan, Feb 17 '16 at 14:15
Sorry my comment was actually for OP. It appears that he wants a regex query based on the example sql in the question and his comments on the other reply. I actually like your idea of using dbms_xmldom for this. — ruudvan, Feb 17 '16 at 14:29

score 3 · Answer 2 · answered Feb 17 '16 at 15:26

Create a (simple) HTML parser.

Oracle Setup:

SET DEFINE OFF;

CREATE OR REPLACE AND COMPILE JAVA SOURCE NAMED HTMLTOTEXT AS
import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

public class HTMLToText {
    private HTMLToText(){};

    private static class TextStripper extends ParserCallback {
        private final StringBuffer buffer = new StringBuffer();

        private static final HashMap<Tag, String> START_TAGS = new HashMap<Tag, String>();
        private static final HashMap<Tag, String> END_TAGS = new HashMap<Tag, String>();
        private static final String NEWLINE = "\r\n";

        static {
            START_TAGS.put(Tag.BR,    NEWLINE);
            START_TAGS.put(Tag.P,     NEWLINE);
            START_TAGS.put(Tag.LI,    NEWLINE);
            START_TAGS.put(Tag.DT,    NEWLINE);
            START_TAGS.put(Tag.DL,    NEWLINE);
            START_TAGS.put(Tag.OL,    NEWLINE);
            START_TAGS.put(Tag.UL,    NEWLINE);
            START_TAGS.put(Tag.TR,    NEWLINE);
            START_TAGS.put(Tag.TD,    "\t");
            START_TAGS.put(Tag.TH,    "\t");

            END_TAGS.put(Tag.P,     NEWLINE);
            END_TAGS.put(Tag.LI,    NEWLINE);
            END_TAGS.put(Tag.DD,    NEWLINE);
            END_TAGS.put(Tag.DL,    NEWLINE);
            END_TAGS.put(Tag.OL,    NEWLINE);
            END_TAGS.put(Tag.UL,    NEWLINE);
            END_TAGS.put(Tag.TR,    NEWLINE);
        }

        private boolean newline = true;

        @Override
        public void handleText( final char[] data, final int pos ){

            buffer.append(data);
            newline = false;
        }
        @Override
        public void handleStartTag( final Tag tag, final MutableAttributeSet attribute, final int pos ){
            if ( !newline && START_TAGS.containsKey( tag ) )
            {
                final String value = START_TAGS.get( tag );
                buffer.append( value );
                newline = value.equals(NEWLINE);
            }
        }
        @Override
        public void handleEndTag( final Tag tag, final int pos ){
            if ( !newline && END_TAGS.containsKey( tag ) )
            {
                final String value = END_TAGS.get( tag );
                buffer.append( value );
                newline = value.equals(NEWLINE);
            }
        }
        @Override
        public void handleSimpleTag( final Tag tag, final MutableAttributeSet attribute, final int pos ){
            handleStartTag( tag, attribute, pos );
            handleEndTag( tag, pos );
        }
        @Override
        public void handleComment( final char[] data, final int pos ){}
        @Override
        public void handleError( final String errMsg, final int pos ){}

        public String getText(){
            return buffer.toString();
        }
    }

    private static final ParserDelegator DELEGATOR = new ParserDelegator();

    public static String extractText( final String html ) throws IOException{
        TextStripper stripper = new TextStripper();
        DELEGATOR.parse( new StringReader( html ), stripper, true );
        return stripper.getText();
    }
};
/

CREATE OR REPLACE FUNCTION HTML_TO_TEXT(
  in_html  IN VARCHAR2
) RETURN VARCHAR2
AS LANGUAGE JAVA NAME 'HTMLToText.extractText( java.lang.String ) return java.lang.String';
/
SHOW ERRORS;
/

Query:

SELECT HTML_TO_TEXT( '<html><body>Text<p>para<br>graph</p><table><tr><th>R1</th><td>C1</td><td>C2</td></tr><tr><th>R2</th><td>C1</td><td>C2</td></tr></table></body></html>' ) FROM DUAL;

Output:

HTML_TO_TEXT('<HTML><BODY>TEXT<P>PARA<BR>GRAPH</P><TABLE><TR><TH>R1</TH><TD>C1</
--------------------------------------------------------------------------------
Text                                                                            
para                                                                            
graph                                                                           
R1  C1  C2                                                                          
R2  C1  C2

this is working good. but it's not accepting CLOB columns. I have CLOB as input. what change needs to be done on this code? — sureshhh, May 02 '16 at 07:08
I don't have time to test this but try changing the parameter and return types of the final function definition from `VARCHAR2` to `CLOB`. — MT0, May 02 '16 at 08:36
I tried those initla things. But receiving errors like "ORA-00932: inconsistent datatypes: expected a value at argument position 1 that is convertible to a Java java.lang.String got an Oracle CLOB".Since I'm not well in Java not able do changes in your code. — sureshhh, May 02 '16 at 09:13
how can implement unescapeHtml(writer, data); function in tha above answer? — sureshhh, May 04 '16 at 13:23

score 0 · Answer 3 · answered Feb 17 '16 at 12:55

0

Consider running the HTML through lynx -stdin to render it as plain text. Lynx is a web browser for text consoles.

answered Feb 17 '16 at 12:55

neuhaus

3,886
1
10
27

Actually I need a query in oracle to handle html tags and produce output without the html tags but with the meaning of the html tags. – sureshhh Feb 17 '16 at 13:04
What you need is something that can render HTML as text. Probably including stuff like tables, alignment etc. You need a text browser. Is there a way to pipe the query result into an external command that acts as a filter? That would do the trick. – neuhaus Feb 17 '16 at 13:07
I cannot use ant text browser. I need to handle the html tags in the query itself. – sureshhh Feb 17 '16 at 13:24
You can try a best effort solution I suppose. [Here](https://stackoverflow.com/questions/30756921/remove-simple-html-tags-from-string-in-oracle-via-regexp-explanation-needed)'s a question that deals with removing HTML tags. – neuhaus Feb 17 '16 at 13:30
That solution is not perfect. Even I have the better query than that. – sureshhh Feb 17 '16 at 13:50
I need a perfect query which can handle HTML tags with their meaning. – sureshhh Feb 17 '16 at 13:56
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/103737/discussion-between-sureshhh-and-neuhaus). – sureshhh Feb 17 '16 at 14:05

handling HTML data in Oracle query

3 Answers3

Linked