3

Given an html file, how could I find if there's some repeated id value using a regular expression? I need it for searching it on SublimeText.

For example: using the id=("[^"]*").*id=\1 I can find duplicated id keys in the same line

<img id="key"><img id="key">

But what I need is to perform the same in multiple lines and with different pairs of keys. In this case for example key and key2 are repeated ids.

<img id="key">
<img id="key2">
<img id="key">
<img id="key3">
<img id="key2">
<img id="key">

Note: I'm usign the img tag only as an example, the html file is more complex.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
fuxes
  • 418
  • 1
  • 7
  • 18
  • 1
    Perhaps you can send your HTML into a tool that validates HTML and will warn you of duplicated IDs. – Andy Lester Apr 16 '15 at 17:37
  • 1
    [You can't parse \[X\]HTML with regex](http://stackoverflow.com/a/1732454/1529630). You should use a DOM parser (not sure if sublimetext has that). – Oriol Apr 16 '15 at 17:39
  • Also remember these ids: id=abc id='abc' ID="abc" in your regex – mosh Feb 18 '18 at 04:43

4 Answers4

2

For whatever reason, Sublime's . matcher doesn't include line breaks, so you'll need to do something like this: id=("[^"]+")(.|\n)*id=\1

Honestly though, I'd rather use Unix utilities:

grep -Eo 'id="[^"]+"' filename | sort | uniq -c

  3 id="key"
  2 id="key2"
  1 id="key3"

If these are complete HTML documents, you could use the w3's HTML validator to catch dups along with other errors.

fny
  • 31,255
  • 16
  • 96
  • 127
0

If all you're trying to do is find duplicated IDs, then here's a little Perl program I threw together that will do it:

use strict;
use warnings;

my %ids;
while ( <> ) {
    while ( /id="([^"]+)"/g ) {
        ++$ids{$1};
    }
}

while ( my ($id,$count) = each %ids ) {
    print "$id shows up $count times\n" if $count > 1;
}

Call it "dupes.pl". Then invoke it like this:

perl dupes.pl file.html

If I run it on your sample, it tells me:

key shows up 3 times
key2 shows up 2 times

It has some restrictions, like it won't find id=foo or id='foo', but probably will help you down the road.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • Quotes are optional in html5, also id can be ID, iD; so while ( /id="([^"]+)"/g ) { should be while ( /\bid=\S+/ig ) { and then remove (double/single) quotes if any from id. – mosh Feb 18 '18 at 04:40
0

Sublime Text's regex search appears to default to multi-line mode, which means the . won't match line breaks. You can use a mode modifier to use single line mode to make . match new lines:

(?s)id=("[^"]+").*id=\1

The (?s) is the single line mode modifier.

However, this regex does a poor job of finding all duplicate keys since it will only match from key to key in your sample HTML. You probably need a multi-step process to find all keys, which could be programmed. As others have shown, you'll need to (1) pull all the ids out first, then (2) group them and count them to determine which are dupes.

Alternately, the manual approach would be to change the regex pattern to look-ahead for duplicate ids, then you can find the next match in Sublime Text:

(?s)id=("[^"]+")(?=.*id=\1)

With the above pattern, and your sample HTML, you'll see the following matches highlighted:

<img id="key">  <-- highlighted (dupe found on 3rd line)
<img id="key2"> <-- highlighted (dupe found on 5th line)
<img id="key">  <-- highlighted (next dupe found on last line)
<img id="key3">
<img id="key2">
<img id="key">

Notice that the look-ahead doesn't reveal the actual dupes later in the file. It will stop at the first occurrence and indicates that later on there are dupes.

Ahmad Mageed
  • 94,561
  • 19
  • 163
  • 174
0

Here is the AWK script to look-up for duplicated img's id values:

awk < file.txt 
    '{ 
        $2 = tolower($2); 
        gsub(/(id|["=>])/, "", $2); 
        if (NF == 2) 
            imgs[$2]++; 
        } 

        END {

        for (img in imgs) 
                printf "Img ID: %s\t appears %d times\n", img, imgs[img] 
    }' 
Eder
  • 1,874
  • 17
  • 34