-1

So, I know this sounds a bit odd, but basically here is my HTML example:

$400 + free shipping</title>
   <link>https://www.dealnews.com/Samsung-50-4-K-HDR-LED-Smart-TV-for-400-free-shipping/17336849.html?iref=rss-dealnews-editors-choice</link>
   <description>&lt;img src='http://c.dlnws.com/image/upload/f_auto,t_large,q_auto/content/vdiy8a75wg8v7bo92dhq'

I only want to capture the URL of items that have a dollar sign way before it e.g. everthing after $.... than (URL) At the moment my regex is this:

img src='([^']+)'.*

This grabs EVERY img src, however I would only like images like I said before that have the "$" sign before it, essentially I don't want any images that aren't to do with a product on this HTML page.

AAM111
  • 1,178
  • 3
  • 19
  • 39
  • Read here first: https://stackoverflow.com/a/1732454/ – gahooa May 21 '18 at 03:59
  • What you are really looking for is called a **parser** (e.g. `lxml`, `Beautifulsoup`) in combination with **xpath** expressions. While it surely is possible to get the image urls in question with regular expressions, its is prone to errors. – Jan May 21 '18 at 04:09

1 Answers1

-1

Looking at the HTML example you provided it seems your product images are directly preceded by a <description> HTML tag. It takes less processing power (and time) to use a non-capturing group directly before the desired URL rather than looking back all the way to a potential (but not granted) $ sign. If you use the <description> tag exclusively for products than this regular expression will suit your needs: (?:<description>&lt;img src=')([^']+)

Other things to consider:

  • Make sure to add the Global and Multiline modifiers if you require this check for multiple lines across your HTML code.
  • If you need to take HTML entities into account and allow combination of HTML entities alongside parsed HTML consider creating an OR statement to allow for them in your Regex. For example, to allow both < and &lt; before the img tag use: (?:<description>(?:&lt;|<)img src=')([^']+) and if we take into account the opening and closing entities of the description tag as well we end up with this: (?:(?:&lt;|<)description(?:&gt;|>)(?:&lt;|<)img src=')([^']+)
Nadav
  • 1,055
  • 1
  • 10
  • 23