0

I am using this regex:

\b(((\S+)?)(@|mailto\:|(news|(ht|f)tp(s?))\://)\S+)\b

to match this string of text (yes, it includes escaped HTML):

< ahref="http://www.somesite.com/" target="_blank">

But when I run it in Expresso (or any other regex program), all I retrieve is:

ahref="http://www.somesite.com

I need the whole string, including < and target="_blank">

What am I missing in my Regex to make this work?

Rowland Shaw
  • 37,700
  • 14
  • 97
  • 166
Isaiah Nelson
  • 2,450
  • 4
  • 34
  • 53
  • 1
    Don't use regexes to parse HTML code. – m0skit0 Nov 15 '11 at 17:13
  • Your question is incomplete and a candidate for closure. As it stands .* is correct, but I am 100% sure you don't want this. – FailedDev Nov 15 '11 at 17:16
  • 1
    If you want to use regex to parse html (which *is* possible), read this before you do: http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326 – Łukasz Wiatrak Nov 15 '11 at 17:18
  • @Lucasus I am not worried about this being HTML. In .NET this is just a string, and I need the Regex to identify this exact string from a document so I can remove it from a file that I am consuming. But this Regex isn't pulling up the entire string. How do I add in the < and "_blank">? Thats all I am asking. – Isaiah Nelson Nov 15 '11 at 17:23
  • 1
    This is a duplicate of http://stackoverflow.com/questions/8127532/regex-for-including-escaped-html-tags-with-other-regex/8130040#8130040 from yesterday. It had an answer in it, did you check that? –  Nov 15 '11 at 17:59
  • @sln The solution assumed I was using perl. I am only asking for a solution that involves pure Regex. I will handle what I am doing with it in the language of my choice. My question is restructured in this post to be as deliberate as possible: What Regex do I need to include < and target="_blank"> in the Regex I have been using. And FYI, the regex in this post is different than the other one. – Isaiah Nelson Nov 15 '11 at 18:17

2 Answers2

1

Reading your regex, you're looking for something between two word breaks (i.e. white space, or start/end of line, etc); then, anything other than whitespace, followed by anything that looks like a URI up until the next word break, so your pattern is explicitly looking for something that does not contain the spaces that you say you're after.

Rowland Shaw
  • 37,700
  • 14
  • 97
  • 166
1

"What am I missing in my Regex to make this work?"
&lt;[\s\S]*?\b(((\S+)?)(@|mailto\:|(news|(ht|f)tp(s?))\://)\S+)\b[\s\S]*?&gt;