2

I have ploughed through the massive amount of queries on here relating to SUBSTRING and CHARINDEX but I cannot find one that answers my query.

I am extracting a single part of a long piece of text but it is returning 48 characters too much. E.g. in the first row of the results the length should be 44 characters rather than the 92 characters it is giving. I split out the various parts of the query into the last 3 columns to get the starting position and the length but when I put it together it doesn't work correctly. What am I doing wrong?

My code: (Updated to make consumable I hope)

    IF OBJECT_ID('tempdb.dbo.#WQuery', 'U') IS NOT NULL
DROP TABLE #WQuery
CREATE TABLE #WQuery
  ( 
    IncidentID VARCHAR(100),
    Description VARCHAR(250),
  ) 

INSERT INTO #WQuery
  (IncidentID, Description)
VALUES
('B209BBA0-9039-ED11-81AD-0050569FE3BD','<b>Please indicate your company''s export status:</b><br />New to export (less than 1 years experience)<br /><br /><b>What is the destination market for your goods/services?</b><br />United Kingdom<br /><br />'),
('13A75070-0F38-ED11-81AD-0050569FE3BD','<b>Please indicate your company''s export status:</b><br />Novice exporter (1-3 years experience)<br /><br /><b>What is the destination market for your goods/services?</b><br />Kazakhstan <br /><br />'),
('D2926CA8-EB28-ED11-81AC-0050569FE3BD','<b>Please indicate your company''s export status:</b><br />New to export (less than 1 years experience)<br /><br /><b>What is the destination market for your goods/services?</b><br />Zambia (central Africa)<br/'),
('7226B826-DF24-ED11-81AC-0050569FE3BD','<b>Please indicate your company''s export status:</b><br />Experienced exporter (3+ years experience in multiple markets)<br /><br /><b>What is the destination market for your goods/services?</b><br />Spain<br'),
('E636692C-3C22-ED11-81AC-0050569FE3BD','<b>Please indicate your company''s export status:</b><br />New to export (less than 1 years experience)<br /><br /><b>What is the destination market for your goods/services?</b><br />India<br /><br />'),
('C13937A0-EF16-ED11-81AC-0050569FE3BD','<b>Please indicate your company''s export status:</b><br />New to export (less than 1 years experience)<br /><br /><b>What is the destination market for your goods/services?</b><br />Rotterdam<br /><br /><b>')
;

select 
        i.incidentid
        , i.description
        , SUBSTRING(i.description, 
            CHARINDEX('export status:</b><br />', i.description)+LEN('export status:</b><br />'), 
                CHARINDEX('<br /><br /><b>What is the destination market',i.description) - (CHARINDEX('export status:</b><br />', i.description)-LEN('export status:</b><br />')) ) As ExportStatus 
--substring(string,start,length)--
--length is the number of characters to extract - must be positive--
--split the substring above to test the values--
                ,CHARINDEX('export status:</b><br />', i.description)+LEN('export status:</b><br />') as start
                ,CHARINDEX('<br /><br /><b>What is the destination market',i.description) as secondQStart
                ,CHARINDEX('<br /><br /><b>What is the destination market',i.description) - CHARINDEX('export status:</b><br />', i.description)-LEN('export status:</b><br />') as length
--charindex(substring,string,start)--       
from #WQuery i

DROP TABLE #WQuery;

The expected results for column 3 (export status) should be:

Expected Results

However I am getting this:

enter image description here

When I added the code from the column 'Length' into the parameter of the SUBSTRING query it gave an error.

Msg 537, Level 16, State 3, Line 3
Invalid length parameter passed to the LEFT or SUBSTRING function.

When I added the brackets around that part of the SUBSTRING the query ran but as you can see it is not returning the same number of characters and the 'length' in the last column.

I am struggling to understand why the length parameter in the SUBSTRING query is not working the same as the same code in the 'length' column. And I also do not understand why the code needs brackets in the parameter of SUBSTRING query to work or throws the error.

  • 4
    **Consumable** sample data ([Please do not upload images of code/data/errors when asking a question.](//meta.stackoverflow.com/q/285551)) and expected results will help us help you here. – Thom A Sep 28 '22 at 10:44
  • 1
    You can't parse HTML using SQL. SQL is a query language and *very* weak in text manipulation. If the HTML is well formed you *may* be able to use XML functions to extract what you want – Panagiotis Kanavos Sep 28 '22 at 10:48
  • 3
    BTW SQL Server 2008 R2 reached [end of life](https://learn.microsoft.com/en-us/troubleshoot/sql/general/end-support-sql-server-2008) 3 years ago. Why are you still using it? The earliest version in mainstream support is SQL Server 2017 which allows applying Python functions to query results. You could use a proper HTML parser to extract the data you want. In fact, any client language would be better than SQL in HTML parsing – Panagiotis Kanavos Sep 28 '22 at 10:51
  • If your text is actually valid XHTML then you can use XQuery to parse it. For example https://dbfiddle.uk/gy1NARXg – Charlieface Sep 28 '22 at 12:08
  • @Larnu I have posted the expected results in the query – Caroline Allen Sep 28 '22 at 14:03
  • 1
    @Panagiotis_Kanavos we are moving to 365 but in the meantime I have to query with what I have – Caroline Allen Sep 28 '22 at 14:04
  • We need *consumable* sample data, which means text not images – Charlieface Sep 28 '22 at 14:06
  • Your question hasn't been edited by yourself, @CarolineAllen , since I made my comment, so you can't have added the expected results and sample data in a consumable format. – Thom A Sep 28 '22 at 14:08
  • @Larnu - I did try to paste as text but it added as a picture - I will try again. And I will add some sample data - I apologise for that oversight – Caroline Allen Sep 28 '22 at 14:15
  • @Larnu - Code now edited to be consumable I hope. I am not able to use CREATE TABLE so have to use temp table. I hope you will be able to use this version. – Caroline Allen Sep 29 '22 at 06:49
  • @Charlieface - code now edited as above – Caroline Allen Sep 29 '22 at 06:49
  • @Charlieface - the text does come from a website and is saved as a text field in CRM but the XQuery does not seem to work on it. Maybe due to not being in the text. I have never used XQuery before but will look it up and learn from it. – Caroline Allen Sep 29 '22 at 07:00
  • I am having a problem adding the results as text because of the html in the text so have left as a picture for now. I am hoping that someone will be able to help me from the sample code and let me know what I am doing wrong. – Caroline Allen Sep 29 '22 at 07:17

1 Answers1

0

Your text is almost valid XHTML. You just need to fix up the ends of the lines in some cases, I assume this is a copy-paste error.

For example, <br />Spain<br should be <br />Spain<br />, and Rotterdam<br /><br /><b> should be Rotterdam<br /><br /><b />.

Now you can just use XQuery to get the right piece of text, which appears to be the first text not enclosed by any node.

SELECT ExportStatus = CAST(wq.Description AS xml).value('(/text())[1]', 'nvarchar(max)')
FROM #WQuery wq

db<>fiddle

Charlieface
  • 52,284
  • 6
  • 19
  • 43
  • Thank you so much @Charlieface. Very helpful. When I extract the field and then run it through this query it works beautifully. But running the proper query on the system throws errors. At least that tells me that the issue is in the data. Now learning about escaping/removing illegal characters and always good to learn – Caroline Allen Sep 30 '22 at 06:59
  • As I said, this only works with valid XHTML (not jus plain HTML, which allows unclosed tags) – Charlieface Sep 30 '22 at 10:15