1

How to find all "cat"s with a regular expressions?

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems!" (c) Jamie Zawinski

Help me please to find all "cat"s in divs with a single query :)

cat
<div>let's try to find this cat and this cat</div>
cat
<div>let's try to find this cat and this cat</div>
cat

I had do this, but it's not working:

(?<=<div>)((?!<\/div>)(cat|(?:.|\n))+)(?=<\/div>)

Regular expression visualization

Debuggex Demo

I found this problem when i used Sublime Text. We can make only one query. Is it possible? If you can answer using any programming languages (Python, PHP, JavaScript), i'll be glad too. Thank you!

I can find the last cat, or the first one, but need to find all the cats that sit in some DIVs. I suppose it may be possible with other languages stuff, but i want only one query (one line) - it's most interesting for me. If it's not possible, sorry for my post :)

Thanks to @revo! Very nice variant, that works in Sublime Text. Let me add 2nd question for this theme... Сan we do it for divs with class "cats", but not for divs with class "dogs"?

cat
<div class="cats">black cat, white cat</div>
cat
<div class="dogs">black cat, white cat</div>
cat
  • Do you literally just want the word "cat" or the whole tag? – Olga Jan 22 '14 at 19:28
  • 7
    [My favourite SO answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – mplungjan Jan 22 '14 at 19:29
  • olgash, yes, all cats are hidden in the divs :) – DopustimVladimir Jan 22 '14 at 19:35
  • what language do you use? – Casimir et Hippolyte Jan 22 '14 at 19:40
  • 3
    The simplest regex to find occurrences of “cat” is ……… `cat`. Unless you specify additional requirements, there’s no reason to do it more complicated. – Holger Jan 22 '14 at 19:53
  • Holger, the problem happened when I used the code editor "Sublime Text" Thank you for your answer – DopustimVladimir Jan 22 '14 at 20:22
  • @Holger the extra requirement is that it needs to only match cats that are inside divs. – Joeytje50 Jan 22 '14 at 20:40
  • 5
    When it comes to parsing any XML document (HTML or otherwise), regex is usually the wrong tool for the job. There is practically no way to write a regex that matches all possible arrangements of `cat`s and `
    `s (for instance, @casimir-et-hippolyte's answer below will fail this test: `
    cat
    `).
    – asgallant Jan 22 '14 at 20:42
  • Regex + HTML = Bad Bad Bad. – JNYRanger Jan 22 '14 at 21:40
  • I can find the last cat, or the first one, but need to find all the cats that sit in DIVs. I suppose it may be possible with other languages stuff, but i want only one query (1 line), it's most interesting for me. If it's not possible, sorry for my post – DopustimVladimir Jan 22 '14 at 21:52
  • What about `
    `? There are just too many corner cases a simple regex can’t handle.
    – Holger Jan 23 '14 at 08:06
  • @Holger, you're right, but this cat "disabled" - not necessarily to find it – DopustimVladimir Jan 23 '14 at 08:24
  • So you wanna *not* match “cat” inside comments? Then what about ``? Your accepted answer will match that “cat” inside the comment while not matching “cat” outside the comment here: `
    cat
    `. As already said, simple regex for handling Html/Xml doesn’t work.
    – Holger Jan 23 '14 at 08:36

4 Answers4

1

PHP pattern:

$pattern = '~(?><div\b[^>]*+>|\G(?<!^))(?>[^c<]++|\Bc|c(?!at\b)|<(?!/div>))*+\Kcat~';
preg_match_all($pattern, $subject, $matches);
print_r($matches);

Pattern details:

~                  # pattern delimiter
(?>                # atomic group: possible anchor
    <div\b[^>]*+>  # an opening div tag 
  |                # OR
    \G(?<!^)       # a match contiguous to a precedent match
)
(?>                # atomic group: all content between tags that is not "cat"
    [^c<]++        # all characters except "c" or "<"
  |                # OR
    \Bc            # "c" not preceded by a word boundary
  |                # OR
    c(?!at\b)      # "c" not followed by "at" and a word boundary
  |                # OR
    <(?!/div>)     # "<" not followed by "/div>"
)*+                # repeat the group zero or more times
\K                 # reset all that has been matched before from match result
cat                # literal: cat
~

Using the DOM:

$dom = new DOMDocument();
@$dom->loadHTML($yourHtml);
$divs = $dom->getElementsByTagName('div');
foreach($divs as $div) {
    preg_match_all('~\bcat\b~', $div->textContent, $matches);
    print_r($matches);
}
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

This works on Sublime Text:

(?s)(cat)(?=[^>]*?</div>)

Sublime

revo
  • 47,783
  • 14
  • 74
  • 117
0

Considering you didn't specify which language this needs to be in, I'm going to use JavaScript for this solution.

You could do it with a simple trick, that removes all junk:

var string = "<div>let's try to find this cat and this cat</div>\n<div>let's try to find this cat and this cat</div>\nanother cat";
var str = string.replace(/(^|<\/div>)[\w\W]*?(<div>|$)/g,''); //filters out anything outside divs
console.log(str.match(/cat/g)); // ["cat", "cat", "cat", "cat"]

In a single line, this would be:

console.log("<div>let's try to find this cat and this cat</div>\n<div>let's try to find this cat and this cat</div>\nanother cat".replace(/(^|<\/div>)[\w\W]*?(<div>|$)/g,'').match(/cat/g)); // ["cat", "cat", "cat", "cat"]

To make this work even when you need to match things such as:

<div class="foo"><div></div>cat</div>

I'd use the following:

var str = "<div>let's try to find this cat and this cat</div>\n<div>let's try to find this cat and this cat</div>\nanother cat\n<div class=\"foo\"><div></div>and a cat</div>";
var openCounter = 0;
var result = [];
for (var i=0;i<str.length;i++) {
    if (str.substr(i,4) == '<div') openCounter++;
    else if (str.substr(i,6) == '</div>') openCounter = Math.max(0,openCounter-1); //don't go lower than 0
    if (openCounter > 0 && str.substr(i,3) == 'cat') result.push([str.substr(i,3), i]);
}
console.log(JSON.stringify(result)); //[["cat",28],["cat",41],["cat",79],["cat",92],["cat",148]]

That also gets the index at which a cat was found in the string and stores it along with the cat in the result variable.

Joeytje50
  • 18,636
  • 15
  • 63
  • 95
0

This can't be done reliably using regular expressions (as others have mentioned).

The reason is that HTML can contain nested tags, but regular expressions aren't capable of "counting" how many levels deep you are, so you will always be able to construct an example of some HTML example for which your regular expression won't find all the cats.

For parsing HTML you need to use a STACK to keep track of how deep you are within the tags. In this python example I'm using a sequence (self.tags) as a stack:

from HTMLParser import HTMLParser
import re

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.tags = []

    def handle_starttag(self, tag, attrs):
        self.tags.append(tag)

    def handle_endtag(self, tag):
        self.tags.pop()

    def handle_data(self, data):
        if self.tags and self.tags[-1] == 'div':
            # now we are dealing with a single string.
            # use a regular expression to find all cats
            num = len(re.findall('cat', data))
            if num:
                print 'found %d cats at %s' % (num, '.'.join(self.tags))

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('''
cat
<div>let's try to find this cat and this cat</div>
cat
<div>let's try to find this cat and this cat</div>
cat
''')

# now try a trickier example
parser.feed('''<body><div>cat<div>another text</div></div></body>''')

Output:

found 2 cats at div
found 2 cats at div
found 1 cats at body.div

This will also easily extend to matching only particular divs based on the class attribute. (see the attrs argument to handle_starttag).

jdhildeb
  • 3,322
  • 3
  • 17
  • 25