0

I am trying to return a heap of image URLs and want to include every character, such as new lines, in my findall function. However when I used the DOTALL flag and use .* in my regex, I go from having plenty of results, to only one. If anything using .* in the regex code should provide more results not less, because I am saying 'I will allow zero or more of any character here'.

The below code is without the .* It can be run in IDLE or another Python editor and it will return a heap of image URLs.

from urllib import urlopen
from re import findall
import re

dennisov_url = 'https://denissov.ru/en/'
dennisov_html = urlopen(dennisov_url).read()

watch_image_urls = findall('<img src="([^"]*)', dennisov_html, flags=re.DOTALL)
print watch_image_urls

This code below is WITH the .* between the word 'image' and the word 'src', which should have no real effect, yet it only returns one URL this time.

from urllib import urlopen
from re import findall
import re

dennisov_url = 'https://denissov.ru/en/'
dennisov_html = urlopen(dennisov_url).read()

watch_image_urls = findall('<img.* src="([^"]*)', dennisov_html, flags=re.DOTALL)
print watch_image_urls

Can someone tell me why it is doing this and how I fix it?

EDIT: Above is just an example code that I have made simpler to explain my situation. The code below is my actual code along with the comments detailing what I want it to do. If you open the main URL and inspect the code you will see that there are many images between

from urllib import urlopen
from re import findall
import re

dennisov_url = 'https://denissov.ru/en/'
dennisov_html = urlopen(dennisov_url).read()

# Print all images between div class="grid" and div class="orderplacebut"
# Because the regex spans over several lines, use DOTALL flag to include
# every character between, including new lines
watch_image_urls = findall('<div class="grid"*?<img src="([^"]*)*?<div class="orderplacebut"', dennisov_html, flags=re.DOTALL)
print watch_image_urls

EDIT: This is the response I got from my professor. "Apart from this, the strategy you're using to try to match the image URLs can't succeed because the "grid" div class appears just once on the web page (at least when I view it in Firefox) and you seem to be trying to (a) match the start of that class and then (b) get all the images appearing inside it. This is a very hard (I suspect impossible!) thing to do with a single regular expression because you're "anchoring" the pattern to the start of the class, which prevents you from separately matching each of the figures inside it. Because the start of your pattern above only appears once in the web page, only one pattern can ever be returned! (The problem is not to do with "greedy" matching.) Instead you want to match just the figure URLs and as little of the surrounding HTML as possible. For instance, it's easy to get all the patterns that end with ".jpg".

So as you can see, this question was not about greedy vs non-greedy regex, and was not the same as the duplicate question that was flagged.

user88720
  • 322
  • 6
  • 14
  • This question has been marked as a duplicate but the duplicate reference is completely different – user88720 May 18 '17 at 15:43
  • It is a dupe. You just need to replace `.*` with either `.*?` or better with `[^>]*?`. – Wiktor Stribiżew May 18 '17 at 16:02
  • 1
    Just stating that it's not a duplicate doesn't help anything; [edit] to *explain why*. – jonrsharpe May 18 '17 at 21:22
  • Yes it just so happens that if I put a ? after the .* in this specific example, I get plenty of results HOWEVER this was just used a simple example (there is no reason I would want to find anything and everything between img and src. I have updated the question with my actual code at the end, along with what I want it to do in the comments. Analysis of the HTML code will show you there are many photos in the section I have chosen to search, yet I only get one result when I use .*? to search – user88720 May 19 '17 at 02:20

0 Answers0