1

I'm using an application to search this website that I don't have control of right this moment and was wondering if there is a way to ignore duplicate matches using only regex.

Right now I wrote this to get matches for the image source in the pages source code

uses this to retrieve srcs

<span> <img id="imgProduct.*? src="/(.*?)" alt="

from this

<span> <img id="imgProduct_1" class="SmPrdImg selected"     
onclick="(some javascript);" src="the_src_I_want1.jpg" alt="woohee"> </span>
<span> <img id="imgProduct_2" class="SmPrdImg selected"     
onclick="(some javascript);" src="the_src_I_want2.jpg" alt="woohee"> </span>
<span> <img id="imgProduct_3" class="SmPrdImg selected"     
onclick="(some javascript);" src="the_src_I_want3.jpg" alt="woohee"> </span>

the only problem is that the exact same code listed above is duplicated way lower in the source. Is there a way to ignore or delete the duplicates using only regex?

Travis Crum
  • 399
  • 2
  • 5
  • 21
  • Welcome to Stack Overflow! Please refrain from parsing HTML with RegEx as it will [drive you į̷̷͚̤̤̖̱̦͍͗̒̈̅̄̎n̨͖͓̹͍͎͔͈̝̲͐ͪ͛̃̄͛ṣ̷̵̞̦ͤ̅̉̋ͪ͑͛ͥ͜a̷̘͖̮͔͎͛̇̏̒͆̆͘n͇͔̤̼͙̩͖̭ͤ͋̉͌͟eͥ͒͆ͧͨ̽͞҉̹͍̳̻͢](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use an [HTML parser](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php) instead. – Madara's Ghost Aug 20 '12 at 21:01
  • @Truth: He's not actually parsing HTML, though, he just wants the `src` attribute. Regex can handle that much, since there's no need to do bracket balancing. – KRyan Aug 20 '12 at 21:01
  • Parsing HTML using regex has been covered extensively on SO. The consensus is that it should not be done. Related: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Abe Miessler Aug 20 '12 at 21:02
  • 2
    @AbeMiessler: See above re:Truth's comment. This is not the same as that. I love that page and agree with every word, but this question is not the same. – KRyan Aug 20 '12 at 21:03
  • Use XPath instead, it should be very easy to extract the set of nodes you want without explicitly parsing the document. I'd have to know a bit more about the structure and the exact nodes you want to keep to provide a query. – toniedzwiedz Aug 20 '12 at 21:05
  • @DragoonWraith: I disagree. If his source changes (more spaces, change of quotes anything like that) The regex breaks, while a parser does not. So no, your comment has little merit. – Madara's Ghost Aug 20 '12 at 21:11
  • @Truth: You are correct that regex is a bad idea. You are incorrect in including that link, because that link does not address *why* regex is a bad idea *in this case.* Moreover, a more robust pattern that doesn't break with minor source changes is quite possible; see my answer. – KRyan Aug 20 '12 at 21:12
  • Well lets give both options (I'm with DragoonWraith - whilst DOM is a more elegant approach, if you know exactly what you need to match then why bother using DOM!) - use //img[@src] to get all images with a source attribute, I don't however think you'll be able to ensure they are unique so you'll need a dedupe added. – williamvicary Aug 20 '12 at 21:36
  • @truth the only bad thing is I don't actually have access to the site till the handover to our company is complete so the source won't change until we change it ^_^. Using this search app that uses regex is the only option we have right now unless you know something I don't (which you probably do) – Travis Crum Aug 21 '12 at 13:49
  • @williamvicary your way would work great if I only had access to the website in question. The webmaster on their side is being a pain with getting it handed over. I wish I could work with the DOM but I can't. I do know exactly where to find it! just need to get rid of these duplicate matches to make the gather go more smoothly X-P – Travis Crum Aug 21 '12 at 13:56

2 Answers2

2

Your pattern's not very good; it's way too specific to your exact source code as it currently exists. As @Truth commented, if that changes, you'll break your pattern. I'd recommend something more like this:

<img[^>]*src=['"]([^'"]*)['"]

That will match the contents of any src attribute inside any <img> tag, no matter how much your source code changes.

To prevent duplicates with regex, you'll need lookahead, and this is likely to be very slow. I do not recommend using regex for this. This is just to show that you could, if you had to. The pattern you would need is something like this (I tested this using Notepad++'s regex search, which is based on PCRE and more robust than JavaScript's, but I'm reasonably sure that JavaScript's regex parser can handle this).

<img[^>]*src=['"]([^'"]*)['"](?!(?:.|\s)*<img[^>]*src=['"]\1['"])

You'll then get a match for the last instance of every src.

The Breakdown

For illustration, here's how the pattern works:

<img[^>]*src=['"]([^'"]*)['"]

This makes sure that we are inside a <img> tag when src comes up, and then makes sure we match only what is inside the quotes (which can be either single or double quotes; since neither is a legal character in a filename anyway we don't have to worry about mixing quote types or escaped quotes).

(?!
    (?:
        .
    |
        \s
    )*
    <img[^>]*src=['"]\1['"]
)

The (?! starts a negative lookahead: we are requiring that the following pattern cannot be matched after this point.

Then (?:.|\s)* matches any character or any whitespace. This is because JavaScript's . will not match a newline, while \s will. Mostly, I was lazy and didn't want to write out a pattern for any possible line ending, so I just used \s. The *, of course, means we can have any number of these. That means that the following (still part of the negative lookahead) cannot be found anywhere in the rest of the file. The (?: instead of ( means that this parenthetical isn't going to be remembered for backreferences.

That bit is <img[^>]*src=['"]\1['"]. This is very similar to the initial pattern, but instead of capturing the src with ([^'"]*), we're referencing the previously-captured src with \1.

Thus the pattern is saying "match any src in an img that does not have any img with the same src anywhere in the rest of the file," which means you only get the last instance of each src and no duplicates.

If you want to remove all instances of any img whose src appears more than once, I think you're out of luck, by the way. JavaScript does not support lookbehind, and the overwhelming majority of regex engines that do wouldn't allow such a complicated lookbehind anyway.

Community
  • 1
  • 1
KRyan
  • 7,308
  • 2
  • 40
  • 68
  • WOW this is an amazing answer!! If I could up-vote your answer I would but sadly I'm still a noob. Also, thank you very much for explaining it the way you did since I'm extremely newb with using regex. And sadly I will have to use a look ahead since we don't have access to the site yet :( (handover issues) I have to use regex. – Travis Crum Aug 21 '12 at 13:16
  • also I forgot to ask this, my bad, but from what I understand of this code, it will only give up a match if there is a duplicate, or am I misunderstanding that? – Travis Crum Aug 21 '12 at 13:29
  • @TravisDtfsuCrum: No, it matches every case where the same `src` is not found anywhere later in the document, i.e. the last one. If there's only one, then that one is the last one. – KRyan Aug 21 '12 at 15:54
  • @TravisDtfsuCrum: Also, you can upvote; just click the up-arrow. If this worked for you, you can also click the checkmark to mark it as the accepted answer. – KRyan Aug 21 '12 at 15:54
  • oh yeah thanks @DragoonWraith ! I can't upvote yet because I only have a reputation of 6 and you have to have at least 15... you could help by up voting my question :) – Travis Crum Aug 21 '12 at 18:55
  • @TravisDtfsuCrum: Yeah, sure, why not. The only reason I hadn't to begin with is because I didn't think regex was the right approach to the problem, but you've explained that you don't have a choice, so whatever. – KRyan Aug 21 '12 at 19:05
  • THANK YOU! YEAH I CAN UPVOTE NOW! *giggles* – Travis Crum Aug 21 '12 at 19:48
1

I wouldn't work too hard to make them unique, just do that in the PHP following the preg match with array_unique:

$pattern = '~<span> <img id="imgProduct.*? src="/(.*?)" alt="~is';
$match   = preg_match_all($pattern, $html, $matches);

if ($match)
{
     $matches = array_unique($matches[1]);
}

If you are using JavaScript, then you'd need to use another function instead of array_unique, check PHPJS: http://phpjs.org/functions/array_unique:346

williamvicary
  • 805
  • 5
  • 20
  • 1
    This is a massively better approach than a pure-regex solution. I do recommend a more robust pattern for the initial matching, such as `]*src=['"]([^'"]*)['"]`. Also, there's no indication that he's actually using PHP here; I think JavaScript is what's going on. Still, this is the *idea* he wants. – KRyan Aug 20 '12 at 21:27
  • Totally agree, the pattern needed some love but if he's matching data that is strict then there isn't much need to make it better, if it matches then it matches - but your right, its not a pattern that'll work consistently across multiple websites. – williamvicary Aug 20 '12 at 21:32
  • I think I may have ninja-edited you: there's also a concern that I don't see any indication that he's using PHP. Does JavaScript have a function similar to `array_unique` here? – KRyan Aug 20 '12 at 21:48
  • @DragoonWraith your right about it being very specific. I only need it to run on one website which is why I made it so specific. thanks again for your input! I'm going to try out your regex code right now – – Travis Crum Aug 21 '12 at 13:36
  • thank you all for you input. The only problem is I don't have access to the website at this moment because of handover issues, but we need to gather info from it ASAP so were using a browser based app to search and gather that info that uses regex to do so. Should have been more specific. @williamvicary your idea is great but I can't use php only regex searches – Travis Crum Aug 21 '12 at 13:36