0

I have a string that looks like so:

<p class="thumbnail"><img src="/media/2905/260x150.gif" alt="260x150"
                          width="260" height="150" rel="260,150" /></p>

The variables are the image, the src, alt tag, width etc. They could all change.

I'm trying to write a regex that will match whether there is a <p/> tag with a css class of thumbnail and a child node of <img/>, and if so, replace the string to be:

<p><img class="thumbnail" src="/media/2905/260x150.gif" alt="260x150"
        width="260" height="150" rel="260,150" /></p>

I quite simply, am hopelessly lost with the regex! Can anyone provide any pointers, or even a solution?

Linus Caldwell
  • 10,908
  • 12
  • 46
  • 58
higgsy
  • 1,991
  • 8
  • 30
  • 47
  • 2
    I think this post sums up why you should avoid using regexes to do that : http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – Laurent S. Jun 10 '13 at 14:37
  • You stand no chance to accomplish this with regex. HTML is not a regular language. See the link @Bartdude added. – xxbbcc Jun 10 '13 at 14:39
  • When you wish to use *lesser than* and *greater than* signs in a question, use `<` and `>`, respectively. Otherwise SO parses whatever is between them as HTML. Edited to fix that. – Geeky Guy Jun 10 '13 at 14:40
  • Every time you use Regex on HTML, little animals die. – spender Jun 10 '13 at 14:41
  • @spender Let's hope the OP doesn't hate them little animals ;-) – Nolonar Jun 10 '13 at 14:42

3 Answers3

1

Try using HTML Agility Pack to parse HTML and then to rearrange attributes when you find matches. As I wrote in my comment under your question, you stand no chance to do this with regex if you plan to handle any kind of real-world HTML. Browsers tolerate broken HTML (missing closing tags), invalid tags, etc. that regex would choke on.

xxbbcc
  • 16,930
  • 5
  • 50
  • 83
0

Though it is highly recommended that you do not use Regex to match HTML, I'm going to give you one that will work if the HTML you're working with is extremely consistent. Here is a Rubular to prove the below results.

This Regex <p><img.+class.+?\"thumbnail\".+?<\/p> will match the first and third string below:

<p><img class="thumbnail" src="/media/2905/260x150.gif" alt="260x150" width="260"
        height="150" rel="260,150" /></p>
<p><img class="test" src="/media/2905/260x150.gif" alt="260x150" width="260"
        height="150" rel="260,150" /></p>
<p><img class = "thumbnail" src="/media/2905/260x150.gif" alt="260x150"
        width="260" height="150" rel="260,150" /></p>

Let me clarify the communities position against Regex and HTML. The problem with HTML is that it's by definition not regular, and so its definition alone goes against Regular Expressions. Consider the following HTML:

<img src="some source"></img>
<img src="some source" />

both lines are completely valid, and would get rendered properly by the browser, but as you can see the Regex for those two lines would be completely different.

Linus Caldwell
  • 10,908
  • 12
  • 46
  • 58
Mike Perrenoud
  • 66,820
  • 29
  • 157
  • 232
  • 1
    I've yet to see a single HTML file that's _extremely consistent_. :) – xxbbcc Jun 10 '13 at 14:43
  • @xxbbcc, please see my edit that you probably had not yet seen. I clarified the accepted position. However, if the HTML is generated by some application it would be highly consistent -so to make that assumption would likely be incorrect. – Mike Perrenoud Jun 10 '13 at 14:45
  • I know what you meant. :) I simply implied that any real-world HTML is - usually - FUBAR. – xxbbcc Jun 10 '13 at 14:46
-1

The short answer is that you can't. The long answer is in Bartdude's comment. See this SO question for the theory behind it:

Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

You may try some things that work in a very specific scope. But if you follow the path, the more your project grows, the costlier (in wasted effort) the solution gets, until you finally hit a wall and can't get past it.

Without having seen the rest of your code, my only suggestion is that you make those images and other tags server controls whenever possible. That way, you have them as variables in your C# code, and you can apply OOP logic to your tags. Not ideal, but closer to a proper solution.

Community
  • 1
  • 1
Geeky Guy
  • 9,229
  • 4
  • 42
  • 62