1

Ahoy,

I have a problem, see; I have strings like:

<img width="594" height="392" src="/sites/it_kb/SiteAssets/Pages/exploding%20the%20VDI%20vDesktop/VDI3.PNG" alt="" style="margin:5px;width:619px;height:232px" />

They are not consistently formatted.

I need to parse strings like this, and return the following:

<img width="594" height="392" src="/exploding%20the%20VDI%20vDesktop-VDI3.PNG" alt="" style="margin:5px;width:619px;height:232px" />

Changes:

  1. Remove everything except the immediate directory in which the image file lay.
  2. Instead of that directory being a subdirectory, prepend it onto the file name.

So if the file is currently in /blabla/bla/blaaaaah/pickles/pickle.png

then I want the IMG SRC attribute to say pickles-pickle.png

Now, I've been trying to do this with regex, but after 3 hours, I've discovered something about myself... I am awful at regex. I could be at this for weeks, and I'd never get anywhere.

Thus, I am asking this wonderful community for two things:

  1. How would you do this? Is regex even the right answer? I need to be able to parse any SRC attributes inside IMG tags (whether or not they have height/width or other attributes).
  2. What resources would you recommend for me to learn regex with .NET?

Now for the problem at hand, I suppose I could do a string.replace where I....

  1. Find the IMG tag, and get indexes of the surrounding '<' and '>'
  2. Find index of 'SRC=' and ' ' (space) between those two instances
  3. Find last index of '/' between the src and space indexes
  4. Find second to last index of '/' between src and space indexes
  5. Replace... er no, remove... everything before the second to last instance of '/'...
  6. ...String.Replace remaining '/' with '-'.
  7. ....I.. I think that'd do it?

But DAMN that is ugly. A regex would be so much prettier, don't you think?

Any advice?

Note: I tagged this as 'homework', but it's not homework. I'm volunteering for work after-hours to save the company like 200k. This is literally the last piece of an incredibly convoluted (to me) puzzle. Of course, I don't see a penny of that 200k, but I look good doing it.

vks
  • 67,027
  • 10
  • 91
  • 124
user2328093
  • 99
  • 3
  • 9
  • You can parse html using regex, but it's not a best practice. See [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) why you should do it. – Ivan Zub Oct 17 '14 at 05:30

3 Answers3

4

To get the tag, I suggest using HtmlAgilityPack. It's just safer than to do regex on an entire HTML page.

Use something like this to get the image nodes:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var imgs = doc.DocumentNode.SelectNodes("//img");

Use something like this to get/set the attributes:

foreach (var img in imgs)
{
string orig = img.Attributes["src"].Value;
//do replacements on orig to a new string, newsrc
img.SetAttributeValue("src",newsrc);
}

So, what kind of replacements should you do? I do agree that using a Regex is much more elegant. Things like these are what it's for after all!

Something like this should do the trick:

string s = @"/sites/it_kb/SiteAssets/Pages/exploding%20the%20VDI%20vDesktop/VDI3.PNG";
string n = Regex.Replace(s,@"(.*?)\/([^\/]*?)\/([^\/]*?)$",@"/$2-$3");

Some resources that you can use to learn C# Regexing:

dotnetperls Regex.Match

MSDN: Regex.Match method

MSDN Regex Cheat Sheet

Tyress
  • 3,573
  • 2
  • 22
  • 45
0
(?<=src=)"[^" ]*\/(?=[^\/"]*\/)

Try this.Replace with empty string.

http://regex101.com/r/dZ1vT6/50

Must warn you its a kind of hack.Html should not be parsed with regex.

vks
  • 67,027
  • 10
  • 91
  • 124
0

Replace this

(?i)(?<=<img\s[\s\S]*?src=")(?:[^"]*\/)+(?=[^"]*\/)([^\/]*)\/([^"]+)

To:

/$1-$2
walid toumi
  • 2,172
  • 1
  • 13
  • 10