1

i have an nvarchar(max) column and i need to extract everything between the opening a href tags and the closing a href tag. for example if the contents of my column where the following:

Here you can visit <a href="http://www.thisite.com">this link</a> or this
<a href="http://www.newsite.com">new link</a>. this is just a test to find the right answer.

then my results of my query should be:

"<a href="http://www.thisite.com">this link</a>"
"<a href="http://www.newsite.com">new link</a>"

any help would be greatly appreciated!

Jose Luis
  • 3,307
  • 3
  • 36
  • 53
james
  • 11
  • 2

2 Answers2

1

You have to use CLR User-defined function (supported in Sql Server 2005+):

Regular Expressions Make Pattern Matching And Data Extraction Easier

Artem Koshelev
  • 10,548
  • 4
  • 36
  • 68
  • I agree on the use of the CLR for this but regexes can't be used *reliably* to parse html. For a good read on that, take a look at [You can't parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)... *The
    cannot hold*
    – Lieven Keersmaekers Sep 30 '11 at 06:41
  • *HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions* – Lieven Keersmaekers Sep 30 '11 at 06:46
  • I agree, but author is not required to **parse HTML**. Since the problem is simple “to extract everything between the opening a href tags and the closing a href tag” - KISS principle should work just fine here. – Artem Koshelev Sep 30 '11 at 06:49
  • I don't know the difference between parsing and extracting but for this case, regexes might be sufficient. I just wanted to *warn* OP about possible problems he *might* encounter using regexes. Note that writing a decent regex can be quite tricky and if OP is going to use the CLR, afaik .NET already provides libraries that can parse HTML. – Lieven Keersmaekers Sep 30 '11 at 07:06
0
declare @a varchar(max) = 'Here you can visit <a href="http://www.thisite.com">this link</a> or this <a href="http://www.newsite.com">new link</a>. this is just a test to find the right answer. '

;with cte as
(
select cast(1 as bigint) f, cast(1 as bigint) t 
union all 
select charindex('<a href=', @a, t), charindex('</a>', @a, charindex('<a href=', @a, t)) 
from cte where charindex('<a href=', @a, t) > 0
)
select substring(@a, f, t-f)+'</a>' from cte
where t > 1
t-clausen.dk
  • 43,517
  • 12
  • 59
  • 92