0

I have one query that returns the following HTML element:

<span CreatedFromTXTextControl="1" style="font-family:Tahoma;font-size:8pt;">
<p lang="en-US" style="text-indent:0pt;margin-left:0pt;margin-top:0pt;margin-bottom:6pt;margin-right:0pt;line-height:100%;" xmlns="http://www.w3.org/1999/xhtml"><span style="font-family:'Tahoma';font-size:8pt;">I am not sure how to extract this text using SQL.</span></p>
</span>

My query currently looks something like this:

SELECT text
FROM MyTable

How shall I change this query to return only the text inside the span element shown above?

In the example above the result should be the string: I am not sure how to extract this text using SQL..

DBMS implementation: MS SQL Server.

Zizzipupp
  • 1,301
  • 1
  • 11
  • 27
  • 2
    I wouldn't do this in SQL only. I would analyze it via a server-side language and process this as actual HTML to get the value because there are DOM processing libraries that would be able to handle this better than SQL. For example, DOM processing libraries will be able to handle/account for invalid HTML, where SQL would just be analyzing a string of text which may or may not be valid HTML. – WOUNDEDStevenJones Nov 02 '20 at 15:55
  • But I am 100% confident this is valid HTML – Zizzipupp Nov 02 '20 at 15:57
  • 2
    This might be a good place to remind of [this famous SO answer](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) from back in SO early days... – GMB Nov 02 '20 at 15:57
  • Can you provide other examples of row data? Are they all consistently `

    text

    `? Do they all have these extra style attributes? If there's a lot of consistency here you may be able to "cheat" parsing the HTML and split up strings. For instance, maybe you can find the location of ``, split the string, and replace `` with an empty string, and you'll be left with your text.
    – WOUNDEDStevenJones Nov 02 '20 at 16:00
  • Yep, they all have the same format and style attributes, the only thing changing being the length of the text string inside the span element. – Zizzipupp Nov 02 '20 at 16:01
  • Please tag your question with your specific DBMS implementation (e.g. MySQL, PostgreSQL, Oracle, MS SQL Server, IBM DB2, etc.) `Substring`, `Position`, `Charindex` etc are all useful functions to parse strings. If you search the web for `sql extract string after character` there are many examples that should get you there, but without knowing which SQL implementation we can't give you a direct answer. – WOUNDEDStevenJones Nov 02 '20 at 16:05
  • Sorry, good point. The implementation is MS SQL Server. – Zizzipupp Nov 02 '20 at 16:32

1 Answers1

1

Try this:

DECLARE @Data table ( html varchar(4000) );
INSERT INTO @Data VALUES ( 
    '<span CreatedFromTXTextControl="1" style="font-family:Tahoma;font-size:8pt;">
      <p lang="en-US" style="text-indent:0pt;margin-left:0pt;margin-top:0pt;margin-bottom:6pt;margin-right:0pt;line-height:100%;" xmlns="http://www.w3.org/1999/xhtml">
        <span style="font-family:Tahoma;font-size:8pt;">I am not sure how to extract this text using SQL.</span>
      </p>
     </span>'
);

SELECT
    t.f.value( '.', 'varchar(255)' ) AS span_value
FROM @Data AS d
CROSS APPLY (
    SELECT CAST ( d.html AS xml ) AS h
) AS x
CROSS APPLY x.h.nodes( '//span/*' ) t(f);

Returns

+---------------------------------------------------+
|                   span_value                      |
+---------------------------------------------------+
| I am not sure how to extract this text using SQL. |
+---------------------------------------------------+

Using SQL Server's XML datatype can get you what you need--assuming your HTML is valid.

critical_error
  • 6,306
  • 3
  • 14
  • 16