How to find all cats with a Regular Expressions

Question

How to find all "cat"s with a regular expressions?

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems!" (c) Jamie Zawinski

Help me please to find all "cat"s in divs with a single query :)

cat
<div>let's try to find this cat and this cat</div>
cat
<div>let's try to find this cat and this cat</div>
cat

I had do this, but it's not working:

(?<=<div>)((?!<\/div>)(cat|(?:.|\n))+)(?=<\/div>)

Regular expression visualization

Debuggex Demo

I found this problem when i used Sublime Text. We can make only one query. Is it possible? If you can answer using any programming languages (Python, PHP, JavaScript), i'll be glad too. Thank you!

I can find the last cat, or the first one, but need to find all the cats that sit in some DIVs. I suppose it may be possible with other languages stuff, but i want only one query (one line) - it's most interesting for me. If it's not possible, sorry for my post :)

Thanks to @revo! Very nice variant, that works in Sublime Text. Let me add 2nd question for this theme... Сan we do it for divs with class "cats", but not for divs with class "dogs"?

cat
<div class="cats">black cat, white cat</div>
cat
<div class="dogs">black cat, white cat</div>
cat

[My favourite SO answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — mplungjan, Jan 22 '14 at 19:29
The simplest regex to find occurrences of “cat” is ……… `cat`. Unless you specify additional requirements, there’s no reason to do it more complicated. — Holger, Jan 22 '14 at 19:53
Holger, the problem happened when I used the code editor "Sublime Text" Thank you for your answer — DopustimVladimir, Jan 22 '14 at 20:22
@Holger the extra requirement is that it needs to only match cats that are inside divs. — Joeytje50, Jan 22 '14 at 20:40
When it comes to parsing any XML document (HTML or otherwise), regex is usually the wrong tool for the job. There is practically no way to write a regex that matches all possible arrangements of `cat`s and `
`s (for instance, @casimir-et-hippolyte's answer below will fail this test: `
cat
`). — asgallant, Jan 22 '14 at 20:42
I can find the last cat, or the first one, but need to find all the cats that sit in DIVs. I suppose it may be possible with other languages stuff, but i want only one query (1 line), it's most interesting for me. If it's not possible, sorry for my post — DopustimVladimir, Jan 22 '14 at 21:52
What about `
`? There are just too many corner cases a simple regex can’t handle. — Holger, Jan 23 '14 at 08:06
@Holger, you're right, but this cat "disabled" - not necessarily to find it — DopustimVladimir, Jan 23 '14 at 08:24
So you wanna *not* match “cat” inside comments? Then what about ``? Your accepted answer will match that “cat” inside the comment while not matching “cat” outside the comment here: `
cat
`. As already said, simple regex for handling Html/Xml doesn’t work. — Holger, Jan 23 '14 at 08:36

Casimir et Hippolyte · Answer 1 · 2014-01-22T20:10:44.873

PHP pattern:

$pattern = '~(?><div\b[^>]*+>|\G(?<!^))(?>[^c<]++|\Bc|c(?!at\b)|<(?!/div>))*+\Kcat~';
preg_match_all($pattern, $subject, $matches);
print_r($matches);

Pattern details:

~                  # pattern delimiter
(?>                # atomic group: possible anchor
    <div\b[^>]*+>  # an opening div tag 
  |                # OR
    \G(?<!^)       # a match contiguous to a precedent match
)
(?>                # atomic group: all content between tags that is not "cat"
    [^c<]++        # all characters except "c" or "<"
  |                # OR
    \Bc            # "c" not preceded by a word boundary
  |                # OR
    c(?!at\b)      # "c" not followed by "at" and a word boundary
  |                # OR
    <(?!/div>)     # "<" not followed by "/div>"
)*+                # repeat the group zero or more times
\K                 # reset all that has been matched before from match result
cat                # literal: cat
~

Using the DOM:

$dom = new DOMDocument();
@$dom->loadHTML($yourHtml);
$divs = $dom->getElementsByTagName('div');
foreach($divs as $div) {
    preg_match_all('~\bcat\b~', $div->textContent, $matches);
    print_r($matches);
}

Thank you for your pattern! But i get nothing - Array ( [0] => Array ( ) ) — DopustimVladimir, Jan 22 '14 at 19:57

revo · Accepted Answer · 2014-01-22T21:49:59.603

1

This works on Sublime Text:

(?s)(cat)(?=[^>]*?</div>)

Sublime

edited Jan 22 '14 at 21:49

answered Jan 22 '14 at 21:27

revo

47,783
14
74
117

let's try to find this cat and this cat hello, i'm a cat
let's try to find this cat and this cat
– DopustimVladimir Jan 22 '14 at 21:31
yes, thank you very much! now it's most useful answer for me and it's really works – DopustimVladimir Jan 22 '14 at 22:00
Then let another question! Сan we do it for divs with some class, but not another divs? cat
cat
cat
cat
cat – DopustimVladimir Jan 22 '14 at 22:09
This regex fails for `
cat
another text
`. Did someone mention that parsing recursive structures like Html/XML with regex doesn’t work? – Holger Jan 23 '14 at 08:30

Joeytje50 · Answer 3 · 2014-01-22T20:58:53.547

0

Considering you didn't specify which language this needs to be in, I'm going to use JavaScript for this solution.

You could do it with a simple trick, that removes all junk:

var string = "<div>let's try to find this cat and this cat</div>\n<div>let's try to find this cat and this cat</div>\nanother cat";
var str = string.replace(/(^|<\/div>)[\w\W]*?(<div>|$)/g,''); //filters out anything outside divs
console.log(str.match(/cat/g)); // ["cat", "cat", "cat", "cat"]

In a single line, this would be:

console.log("<div>let's try to find this cat and this cat</div>\n<div>let's try to find this cat and this cat</div>\nanother cat".replace(/(^|<\/div>)[\w\W]*?(<div>|$)/g,'').match(/cat/g)); // ["cat", "cat", "cat", "cat"]

To make this work even when you need to match things such as:

<div class="foo"><div></div>cat</div>

I'd use the following:

var str = "<div>let's try to find this cat and this cat</div>\n<div>let's try to find this cat and this cat</div>\nanother cat\n<div class=\"foo\"><div></div>and a cat</div>";
var openCounter = 0;
var result = [];
for (var i=0;i<str.length;i++) {
    if (str.substr(i,4) == '<div') openCounter++;
    else if (str.substr(i,6) == '</div>') openCounter = Math.max(0,openCounter-1); //don't go lower than 0
    if (openCounter > 0 && str.substr(i,3) == 'cat') result.push([str.substr(i,3), i]);
}
console.log(JSON.stringify(result)); //[["cat",28],["cat",41],["cat",79],["cat",92],["cat",148]]

That also gets the index at which a cat was found in the string and stores it along with the cat in the result variable.

edited Jan 22 '14 at 20:58

answered Jan 22 '14 at 20:39

Joeytje50

18,636
15
63
95

Thank you! Can i do this with a single query? – DopustimVladimir Jan 22 '14 at 20:45
@DopustimVladimir is this what you were looking for? – Joeytje50 Jan 22 '14 at 20:47
2

This fails the test `
cat
` as well. – asgallant Jan 22 '14 at 20:48
joeytje50, Thank you again! It's not quite what I wanted, but it's cool variant. Let me edit my post – DopustimVladimir Jan 22 '14 at 20:58
@asgallant, yes, it returns null – DopustimVladimir Jan 22 '14 at 21:01
@asgallant this new code does work for that extra requirement. – Joeytje50 Jan 22 '14 at 21:03
@DopustimVladimir if you run this new code, it will grab basically any cat in a box you'll throw at it. – Joeytje50 Jan 22 '14 at 21:03
The point was regex is the wrong tool for the job >;o) A DOM parser (like the one jQuery uses) can do the job even faster with less code. Ex: http://jsfiddle.net/asgallant/k98EF/ – asgallant Jan 22 '14 at 21:28

score 0 · Answer 4 · answered Jan 26 '14 at 05:06

This can't be done reliably using regular expressions (as others have mentioned).

The reason is that HTML can contain nested tags, but regular expressions aren't capable of "counting" how many levels deep you are, so you will always be able to construct an example of some HTML example for which your regular expression won't find all the cats.

For parsing HTML you need to use a STACK to keep track of how deep you are within the tags. In this python example I'm using a sequence (self.tags) as a stack:

from HTMLParser import HTMLParser
import re

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.tags = []

    def handle_starttag(self, tag, attrs):
        self.tags.append(tag)

    def handle_endtag(self, tag):
        self.tags.pop()

    def handle_data(self, data):
        if self.tags and self.tags[-1] == 'div':
            # now we are dealing with a single string.
            # use a regular expression to find all cats
            num = len(re.findall('cat', data))
            if num:
                print 'found %d cats at %s' % (num, '.'.join(self.tags))

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('''
cat
<div>let's try to find this cat and this cat</div>
cat
<div>let's try to find this cat and this cat</div>
cat
''')

# now try a trickier example
parser.feed('''<body><div>cat<div>another text</div></div></body>''')

Output:

found 2 cats at div
found 2 cats at div
found 1 cats at body.div

This will also easily extend to matching only particular divs based on the class attribute. (see the attrs argument to handle_starttag).

How to find all cats with a Regular Expressions

4 Answers4