0

I'm trying to find logos on websites.

XPath (//img[contains(@src,"logo")]/@src) works if the logo is inside but there are websites that have their logo defined in style:

<html>
   <head>
      <style>
         someclass {
            background-image: url("/css/images/logo2.jpg");"
            background-color: #cccccc;
         }
      </style>
   </head>
   
   <body>
      <h1>Hello World!</h1>
   </body>
<html>

So I'm trying to build a regex for such cases:

[\"\']([\a-zA-Z0-9-_]*logo[a-zA-Z0-9\-_]*\.(?:png|jpg|jpeg)).*?"

This, for example, is capturing "/e/logo_adsada.jpg?size=400" but also next characters.

Here is the example:

https://regex101.com/r/rV3oP8/160

Do you know what is wrong?

Milano
  • 18,048
  • 37
  • 153
  • 353

2 Answers2

0

I believe your issue is greediness. This is usually a flag in regex engines. On the website you posted you could activate the "Ungreedy" flag.

Quoting a quote from another question, who is quoting Regular expression:

The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex.

By using a lazy quantifier, the expression tries the minimal match first.

likle
  • 1,717
  • 8
  • 10
0

Below regex will help

["']([\a-zA-Z0-9-_]*?logo[a-zA-Z0-9\-_]*?\.(?:png|jpg|jpeg)).*?['"]

Regex demo

Liju
  • 2,273
  • 3
  • 6
  • 21