I am trying to return a heap of image URLs and want to include every character, such as new lines, in my findall function. However when I used the DOTALL flag and use .* in my regex, I go from having plenty of results, to only one. If anything using .* in the regex code should provide more results not less, because I am saying 'I will allow zero or more of any character here'.
The below code is without the .* It can be run in IDLE or another Python editor and it will return a heap of image URLs.
from urllib import urlopen
from re import findall
import re
dennisov_url = 'https://denissov.ru/en/'
dennisov_html = urlopen(dennisov_url).read()
watch_image_urls = findall('<img src="([^"]*)', dennisov_html, flags=re.DOTALL)
print watch_image_urls
This code below is WITH the .* between the word 'image' and the word 'src', which should have no real effect, yet it only returns one URL this time.
from urllib import urlopen
from re import findall
import re
dennisov_url = 'https://denissov.ru/en/'
dennisov_html = urlopen(dennisov_url).read()
watch_image_urls = findall('<img.* src="([^"]*)', dennisov_html, flags=re.DOTALL)
print watch_image_urls
Can someone tell me why it is doing this and how I fix it?
EDIT: Above is just an example code that I have made simpler to explain my situation. The code below is my actual code along with the comments detailing what I want it to do. If you open the main URL and inspect the code you will see that there are many images between
from urllib import urlopen
from re import findall
import re
dennisov_url = 'https://denissov.ru/en/'
dennisov_html = urlopen(dennisov_url).read()
# Print all images between div class="grid" and div class="orderplacebut"
# Because the regex spans over several lines, use DOTALL flag to include
# every character between, including new lines
watch_image_urls = findall('<div class="grid"*?<img src="([^"]*)*?<div class="orderplacebut"', dennisov_html, flags=re.DOTALL)
print watch_image_urls
EDIT: This is the response I got from my professor. "Apart from this, the strategy you're using to try to match the image URLs can't succeed because the "grid" div class appears just once on the web page (at least when I view it in Firefox) and you seem to be trying to (a) match the start of that class and then (b) get all the images appearing inside it. This is a very hard (I suspect impossible!) thing to do with a single regular expression because you're "anchoring" the pattern to the start of the class, which prevents you from separately matching each of the figures inside it. Because the start of your pattern above only appears once in the web page, only one pattern can ever be returned! (The problem is not to do with "greedy" matching.) Instead you want to match just the figure URLs and as little of the surrounding HTML as possible. For instance, it's easy to get all the patterns that end with ".jpg".
So as you can see, this question was not about greedy vs non-greedy regex, and was not the same as the duplicate question that was flagged.