-1

All,

I need to write a regular expression to perform the following operations replace

(A)

src ="/folder/image.jpg"

or

src="http://www.mydomain.com/folder/image.jpg"

with

src="/cache/getCacheItem.aspx?source_url=http://www.mydomain.com/folder/image.jpg"

(B)

href="/folder/file.zip"

or

href="http://www.mydomain.com/folder/file.zip"

with

href="/cache/getCaccheItem.aspx?source_url=http://www.mydomain.com/folder/file.zip

I know I can use

(src|href).*?=['|\"](?<url>.*?)['|\"]

with a replace value of

$1="/legacy_integration/cache/getCacheItem.aspx?source_url=$2"

to catch the src=... and href=... attributes. However, I need to filter based on file extension - only match valid image extensions like jpg, png, gif, and only match href extensions like zip and pdf.

Any suggestions? The problem can be summarized as: modify the above expression to match only certain file extensions, and allow the domain http://www.mydomain.com/ to be inserted only if the original url was a relative, thus ensuring that the output text contains the domain exactly once.

Do I need to perform this using two different regular expressions, one for source text including the domain and one without? Or can I somehow use a conditional match statement that, in combination with a replacement expression, will insert the domain or not based on whether the matched text contains the domain?

I know I can perform this using a custom match evaluator, but it seems that it may be faster/more efficient to do it within the regex itself.

Suggestions/comments?

3Dave
  • 28,657
  • 18
  • 88
  • 151
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – AeroX Jul 30 '15 at 16:51

4 Answers4

2

This comes up all the time. Regex is not an appropriate tool to parse a non-regular grammar such as HTML. Use a real parser (like the HTML agility pack) to do this.

annakata
  • 74,572
  • 17
  • 113
  • 180
  • I don't need to parse ALL HTML, just the specified tags. I also have control over the input data and can guarantee that the input text matches the given format. Seems like overkill to involve yet another 3rd party tool here. – 3Dave Sep 14 '09 at 19:04
  • It's not overkill, it's reliability, and it doesn't matter if you parse all if you parse any. Try it, it'll help solve many problems, not just this one. – annakata Sep 14 '09 at 19:28
  • While I appreciate the utterly stable approach, this particular solution as a) working, b) a temporary solution that allows me to present a LOT of legacy ASP content in a new ASP.NET framework, and c) working. As I said, I have control over the input data and can guarantee that my regex works. If I have need of a more general solution in the future, I'll happily explore the agility pack. Thanks. =) – 3Dave Sep 15 '09 at 13:39
  • Okay, I take it back. The HtmlAgilityPack is sweet. – 3Dave Sep 16 '09 at 22:58
1

Does the following expression work?

Regex.Replace(url, 
@"(src|href)\s*=\s*(?:'|")((?:http://www\.mydomain\.com)?.*?(jpg|bmp|png))(?:'|")",
"$1 - /cache/getCacheItem.aspx?source_url=$2");

The idea is that you match the text http://www.mydomain.com conditionally. It will be included as part of the $2 match text. If it was there originally, it will make its way into the replaced string.

David Andres
  • 31,351
  • 7
  • 46
  • 36
0

This pattern will match any path, if you want constrain a path you can add it after the ?/.

(?<pre>(?:src|href)\W*=\W*(?:"|'))(?<url>(?:http://www\.mydomain\.com)?/(?<file>[^"']+))(?<post>"|')

Here's some sample code:

string pattern = "(?<pre>(?:src|href)\\W*=\\W*(?:\"|'))(?<url>(?:http://www\\.mydomain\\.com)?/(?<file>[^\"']+))(?<post>\"|')";

string test = "src =\"/folder/image.jpg\"\r\n"
            + "src=\"http://www.mydomain.com/folder/image.jpg\"\r\n"
            + "href=\"/folder/file.zip\"\r\n"
            + "href=\"http://www.mydomain.com/folder/file.zip\"";

string replacement = "${pre}/cache/getCacheItem.aspx?source_url=http://www.mydomain.com/${file}${post}";

test = Regex.Replace(test, pattern, replacement);
MyItchyChin
  • 13,733
  • 1
  • 24
  • 44
0

What about this?

var reg = new Regex("(/folder/[^\"]+)");
Match m = reg.Match("src=\"http://www.mydomain.com/folder/image.jpg\"");
var result = string.Format("src=\"/cache/getCacheItem.aspx? source_url=http://www.mydomain.com{0}\"", m.Groups[1].Value);
Esben Skov Pedersen
  • 4,437
  • 2
  • 32
  • 46
  • @Espen P: It looks like this results in URLs that always contain http://www.mydomain.com. From what I gather from the OP, David wants this domain included only if it was present in the original URL. – David Andres Sep 14 '09 at 18:42
  • I probably wasn't clear - I want the domain included whether or not it was part of the original URL. – 3Dave Sep 14 '09 at 19:05