3

i have some text which I know and I need to find first tag with class defined around that text.

Example:

<table>
  <td class="foo">
    <p>...</p>
  </td>
  <td class="bar">
    <p>Text i dont know</p>
    <p>Text i know</p>
    <p>Text i dont know</p>
  </td>
</table>

I tried a lot of things. I know how to find a closing tag, but when I try to find opening tag, my regex returns td with class "foo" instead one with class "bar".

I will really appreciate your help.

Edit: I want to do it in python. I provided weak specification of the problem. That tag dont have to be tag, it can be any tag with class specified. I dont want to "parse" html with regex, but i dont see any other way how to do such thing without using regular expressions. What i need is to find first tag around that tag which have class specified.

Meph
  • 522
  • 1
  • 6
  • 14
  • What language? And what are the regexes that you have tried? – Marcus Oct 26 '11 at 09:14
  • 3
    Aaaah... parsing HTML with Regex is [THE EVIL](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)!! – xanatos Oct 26 '11 at 09:15
  • It sounds like you don't know the name of the element you want to match or the value of the `class` attribute, is that right? What if the `table` tag also had a `class` attribute? Would you want to match the whole `table` element or just the `td` element? What if the `p` tag had one? And what regex flavor are you using? That is, what programming language, editor, or other tool? – Alan Moore Oct 26 '11 at 09:36
  • @AlanMoore exactly. I want first occurrence of ANY tag with ANY class specified. – Meph Oct 26 '11 at 10:09

5 Answers5

3

Okay, here we go!

(?s)<(\w+)[^>]*\sclass="[^"]*"[^>]*>(?:(?!</?\1\b|<\w+[^>]*\sclass="[^"]*"[^>]*>)(?:Text i know()|.))*</\1>\2

The usual way to match just one of an element whose name is not know in advance is (we'll assume the (?s) from here on):

<(\w+)[^>]*>(?:(?!</?\1\b).)*</\1>

The lookahead - (?!</?\1\b) - prevents the dot from matching if it happens to be the first character of a tag (opening or closing) with the same name as the element you're currently matching. In this case a class attribute is required too, so the first part becomes:

<(\w+)[^>]*\sclass="[^"]*"[^>]*>

The question wasn't prefectly clear on this, but I'm assuming you want to match the most immediate enclosing element with a class attribute. That is, in the following text, you want to match the td.yes-me element, not the table element.

<table class="not-me">
  <td class="not-me-either">
    <p>Text i dont know</p>
  </td>
  <td class="yes-me">
    <p>Text i dont know</p>
    <p>Text i know</p>
    <p>Text i dont know</p>
  </td>
  <td>
    <p>Text i dont know</p>
  </td>
  <td>
    <p>Text i dont know</p>
    <p>Text i know</p>
    <p>Text i dont know</p>
  </td>
</table>

That means the lookahead also has to exclude any opening tag with a class attribute. It now grows into this:

(?!</?\1\b|<\w+[^>]*\sclass="[^"]*"[^>]*>)

And finally, the element's content should include your target text (Text i know). After the lookahead succeeds, we try to match that; if we succeed, the empty capturing group following it captures an empty string. Otherwise the dot consumes the next character and the process repeats.

When it's all done matching and the closing tag has been matched, the backreference \2 confirms that the target text was seen. Since that group didn't consume any characters, the backreference doesn't either, but it still reports success if the group participated in the match.

Back-assertions (as I like to call them) don't work in all flavors, and aren't officially supported in any of them, but they work in most of the Perl-derived flavors, including Python. (The most notable exceptions are JavaScript and other ECMAScript implementations.)

If you're reaction to this answer is abject horror, don't worry, I'm not offended. ;) Inspiring you to search harder for a solution that doesn't involve regexes is a successful outcome, too. (But it does work!)

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Alan, thanks for perfect explanation. This is just perfect. I wanted most immediate tag. Sorry that I did not answered on your question, but I was busy trying all these solutions out. Thanks to all of you guys, and especially to You, Alan, for perfect explanation. – Meph Oct 26 '11 at 12:22
  • I am just wondering. If I would want to get whole xpath, is it good way to use regular expression, or is there any other solution? – Meph Oct 26 '11 at 12:27
  • I don't know if this is possible with XPath, but I have very little experience with it. If I were you, I'd look into Python's [DOM API](http://docs.python.org/library/xml.dom.html), but I have even *less* experience with that. :-/ – Alan Moore Oct 26 '11 at 12:45
  • If you start to use XPath then you won't use regexes. XPath is more suited to that kind of search. – Ludovic Kuty Oct 26 '11 at 12:45
  • For example: `//*[@class = "bar" and .//text()[. = "Text i know"]]` – Ludovic Kuty Oct 26 '11 at 12:52
2

You have to do this in multiline or dotall mode where the dot . matches a newline.

<(\w+)[^>]*class="([^"]*)"[^>]*>(?:(?!<\/\1).)*Text i know(?:(?!<\/\1).)*<\/\1>

The construct (?!<\/\1). is used to match any character except one that is starting an end tag which has the name matched before.

Note that I have escaped slashes / here and that I didn't escape the double quotes. You might have to escape things differently. I tested it with rubular.

Ludovic Kuty
  • 4,868
  • 3
  • 28
  • 42
  • +1, but what if the `table` tag has a `class` attribute, too? (You never answered me on that, @Meph.) See my answer for details. – Alan Moore Oct 26 '11 at 11:50
  • Has a `class` attribute with the same value as the `class` attribute of the wanted element. Then my expression doesn't work. I didn't take this into consideration. Also I used a generic class value but I think that the class value is well defined when he does its search. – Ludovic Kuty Oct 26 '11 at 12:55
0

Use XDocument if you can.

that will be easier to browse your HTML

(don't forget to put a ROOT tag to your HTML)

Maybe you'll have to do some parse works in your HTML but with such a simple HTML, no issues.

Aelios
  • 11,849
  • 2
  • 36
  • 54
0
<td class=\"(\w+)\">.*?TEXT_YOU_KNOW.*?</td>

And it has to be in DOTALL mode if you are in java, or the equivalent in the language you use, because the dots need to match line terminators.

EDIT : to match any tag how about :

class=\"(\w+)\".*?TEXT_YOU_KNOW
kgautron
  • 7,915
  • 9
  • 39
  • 60
  • 2
    You may need to make your matches non-greedy `*?` – Bohemian Oct 26 '11 at 09:20
  • 1
    Making it non-greedy won't help. It will still start matching at the beginning of the `td.foo` element and stop at the end of the `td.bar` element. – Alan Moore Oct 26 '11 at 10:00
  • @KayKay thanks for Your answer. I edited the question. I am interested in any tag, not just . First tag around text i know with class specified. – Meph Oct 26 '11 at 10:08
  • @KayKay Thanks, this was one of my first attempts, but it does not work, dont know why, because it should from my pov – Meph Oct 26 '11 at 10:58
0

How about:

<(?<tagname>\w+)\s*class="[^"]*"[^>]*>Text which is known</\1>
jens_profile
  • 351
  • 4
  • 11
  • If you're going to use named groups, you should use named backreferences, too (i.e., `\k`, not `\1`). – Alan Moore Oct 26 '11 at 09:50
  • @jens_profile Thanks for your answer, i edited my question. I am not interested only in tag, but any tag with class specified. – Meph Oct 26 '11 at 10:06