1

I've loaded an HTML doc into a string with .NET. I have this REGEX which I can use to match URLs and replace them, but I need only to match ONLY URLs that are NOT fully qualified.

If this is my string:

djdjdjdjdjdj src="www.example.com/images/x.gif" dkkdkdkdk src="/images/x.gif

My result result would look like this:

djdjdjdjdjdj src="subdomain.example.com/images/x.gif" dkkdkdkdk src="http://www.example.com/images/x.gif

My thinking is I need a REGEX that will match strings that start with src or href and that do not have more than one period. This Regex matches links that have at least one period so it's not matching them correctly.

(src|href)\=(\"(.+?)[\.](.+?)\")

Thanks for any info. I'm coding this in C# but only need the REGEX

Gumbo
  • 643,351
  • 109
  • 780
  • 844
JC.
  • 11
  • 1
  • 1
    Have you considered that there exist domains such as mydomain.co.uk? Could this be an issue for you? – Mark Byers Jan 05 '10 at 13:46
  • 2
    Don’t use regular expressions for a non-regular language like HTML. Use an HTML parser instead. – Gumbo Jan 05 '10 at 13:48
  • If the attribute contains the domain name, it *must* start with "http://" or similar, or else it will be treated as a local (relative) path. Unless you have a folder in your app named "www.example.com"? So that's what you really need to look for, not whether the URL contains more than 1 period. For that matter, what about an intranet scenario where it references http: //myserver/ (no period at all in the domain)? – GalacticCowboy Jan 05 '10 at 15:39

2 Answers2

3

I would suggest you try to use something like the HTML Agility parser, as reccomended many times on this site: Looking for C# HTML parser

Also it wouldn't hurt to read this obscure blog entry by some Metallica fan before you start.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Tj Kellie
  • 6,336
  • 2
  • 31
  • 40
1

Warning : HTML + regex = round peg + square hole

That being said, here's the hammer you requested

(src|href)\=(\"[^."]*\.?[^."]\")
Zen
  • 119
  • 7